Now it is time to spell out exactly what to look for when creating good raw data (or repairing bad input data). Use this as a checklist to quickly classify data into GOOD or BAD (requires clean-up).
Photo credit: Flyinace2000 / Foter / CC BY-SA
Good input data must satisfy these criteria
- Each column must have a heading (not data)
- No blank headings
- No duplicate headings
- No formulas in headings
- No merged cells
- Only one meaning in each column
- Same data type in a column
- No Horizontal Calculations
Creating good data
When you are about to capture new data, we decide the columns we need and put headings for them. Often, we start with few headings and then type / gather data. During the process, we may realize that we need some more columns. We then add more columns.
This process is fairly ad-hoc. At this stage lot of mistakes can happen which will lead to poor quality raw data. This will haunt you throughout the data analytics phase.
Here are important precautions you must take while deciding how to capture data.
The key thing to remember is that one column must have one meaning.
One Column = One Meaning
We often forget this requirement. Therefore, we have columns like this:
Obviously, there are two things being described, the material and the size. So we need two columns.
Sometimes, it is so subtle that we don’t realize it.
That * is additional meaning. Give it another column called Items in Stock.
Hierarchy in one column
Another common mistake – especially with financial data.
Income and Expenses are categories. They deserve separate columns. March 12 and Mar 11 are DATA – NOT HEADINGS. So GOOD data should look like this.
Yes, you will see lot of repetition in the Category column.
How to avoid repetition?
Repetition in the Category column above indicates that we need two tables.
One table should be actual data from Subcategory onwards. And another table which contains just the Sub-category and Category Mapping.
Now you will say, what is the point? In the Pivot Table, we have to show Category anyway. Pivot Table cannot pick up data from two separate tables like this. Therefore, you are forced to put another VLOOKUP column in the green table to get the Category from the orange table – which finally gives you the original table we started with.
Agreed. But that was with older version of Pivot Table. With Power Pivot Data model, such VLOOKUP is no longer required. Why? Because Power Pivot CAN use multiple tables.
With Power Pivot Data Model, we can create relationship between two tables and use the Category column from a separate table. This is how we eliminate duplication.
Read this article for details : Using Power Pivot instead of VLOOKUP
Ideally, to make the data smaller, each subcategory can be given a smaller numeric ID column. And the orange table should have ID, Subcategory and Category columns.
When there is repetition,
think whether you need to split data into multiple tables
This is an important point to keep in mind.
In the next article, we will explore various advantages of converting GOOD RAW DATA into Excel Tables.
This article is a part of a series: Knowledge Pack – Excel Tables