|Reports generated from business applications often become inputs to new reports.
You end up spending too much time cleaning them up. Here is a new and unimaginably faster way of doing it – using Power Query.
Estimated reading time 15 min
Often we use one report output to create another report by combining it with more data. Export to Excel option usually dumps a CSV file. But the data in it is not tabular. It is still like a printout. It contains many pages with headers, footers and totals. We usually DO NOT want those extra things. So we end up spending too much time removing them manually. Here is a sample showing what we remove manually and repeatedly.
Remember that anything repetitive indicates that your method is inefficient. There must be another smarter and faster way.
We saw one method which works across all Excel versions in an earlier article:
Data Clean-up: System Reports
In this article we will see a similar method using Power Query – the new data gathering and clean up tool from Microsoft.
If you follow this method and adopt it across the organization, you can literally save thousands of days of manual effort.
Why use Power Query when older method is still working?
Good question. The reason is simple but fundamental.
Traditionally we worked like this…
We first imported data into Excel and then thought about cleaning it up.
Now Power Query says do it differently…
This is beneficial because unwanted data is never imported. Power Query shows only part of the data so that we can remove unwanted rows and columns, change data types, correct mistakes BEFORE importing the data. When fully clean, import the data.
This is very important if data is large because data beyond ONE MILLION rows it is impossible to import it into an Excel sheet. Power Query can put the data directly into Power Pivot Data Model – which has no row limit. This way we can analyze millions of rows of data using Excel.
So now let us use Power Query to view, clean up and then finally import the data from this badly formatted report dump.
Install Power Query
Power Query requires Office 2010 / 2013 Professional. Download Power Query from here and install it.
Download and follow along
Unzip, Save the TXT file and follow instructions below.
Open the file in Power Query
Open Excel. Create a new blank workbook. Click on Power Query tab and study all available options. It can import data from many sources. Choose the From File option and choose Text file.
Choose the Text file. This file is TAB delimited. Power Query understands this automatically and imports the file. Of course it is in a bad format right now. We will clean it up step by step.
Usually we would have used AutoFilter in Excel to remove the blank rows. But wait. Which blank rows? To understand that it is important to scroll the data and find a column which is USEFUL.
Find a USEFUL column
We want to remove unwanted rows – those which contain the page header / footer.
But how to decide which row to keep and which one to remove?
Look at the data… Observe the pattern of data in each column. Find a column where an empty cell identifies an unwanted row. In this case Column4 is a good column for this purpose. If a cell in Column4 is empty, we don’t want the entire row. That is why I call this column – a USEFUL column.
Remember: For your data, you need to find the USEFUL column. If you cannot find a useful column, this approach will not work. In that case we have to find another method to clean up the data.
You DO NOT need to highlight the column. Just identify it in your mind. Now we are ready to remove unwanted rows.
Remove unwanted rows
Now this is like AutoFilter. All rows which have an empty cell in Column4 need to be removed. Right click on any empty cell in Column4 and choose – Text Filters – Does Not Equal.
It is not necessary to say Does not Equal WHAT? We right clicked inside an empty cell. Therefore the value of that cell is automatically picked up.
Therefore, right now, it means:
Filter all the rows where the cell value in Column4 Does Not Equal (empty).
This is what you get. All header/ footers are removed.
Notice that there are empty columns as well which we don’t want. Usually we delete such columns one by one (or by selecting all of them). It can be done in Power Query also.
However, usually, the columns we need are fewer than all the available columns. Therefore Power Query gives us ability to SELECT what we need and remove the unwanted ones.
Remove Unwanted Columns
Select the first column by clicking on the header – Column 4 in this case. Now keep pressing the CTRL key and click on the headers of all required columns.
Finally, right click on the header of any selected column and choose Remove Other Columns.
Now the data is almost clean. But not fully clean.
First Row as header
Columns are still named as Column7, Column8, etc. However, the first row of the data contains real column names.
Open the Home tab in Power Query and choose Use First Row As Headers option.
So far so good. Proper headers are now in place.
One final problem remains…
Removing repeat headers
Remember that this data was like a printout with multiple pages. The header – Vendor, PBK, Doc No, etc. will repeat for each page. First row we have converted to header. But remaining repeated header rows are still lying around.
We know what to do now. Scroll down and find one such row. Right click on any header cell and choose Text Filters – Does Not Equal.
Now the data is really clean.
All the data is still not visible
Remember that Power Query is NOT showing all the data. It is showing first few rows which are enough to understand the pattern and perform clean up operations. It will fetch more data if you scroll down. But why do that? Now that the data is clean, it is time to import it into Excel.
But wait… one more decision needs to be taken.
Import into Excel sheet or Data Model?
This is a crucial decision. If data is small and regular pivot table, sorting, filtering is to be done, import it into Excel worksheet.
If data is large, it needs to be combined with other tables, needs special functions for complex calculations it is better to import it directly into the more powerful Data Model.
How to decide this? Open the Close and Load option dropdown…
Finally import the data and analyze it.
Bonus: These steps are remembered for future use
What happens next month? Similar dump will arrive and you will have to read this article again and follow these steps…. NO.
In Power Query, look at the right side. All our steps are being remembered. In fact it is automatically generating the code required to perform these steps. It is NOT VBA. It is a new language called M.
You need not learn the language if the job is doable using the Power Query menus and options. More complex operations may require manual coding in M.
For the sake of completeness, here are the steps and the associated code.
Refreshing the data next time
Next time you may receive incremental update or new data altogether. In any case, as long as the file name and path remains same, it is possible to refresh the query. Of course the query will re-import all the data (it is NOT incremental update). But all the manual work we did first time to define the clean up steps is now automated.
If data is in Worksheet, Click the data – you see Table Tools menu on top. Choose Query tab and click Refresh.
If data is in Data Model, choose Workbook Queries, right click on the query and choose Refresh.
Try it out with your data
This is a new concept. You may understand its power but that is not enough. You need the confidence to change from the older inefficient method and adopt this method.
How to build that confidence? Simple! Try this approach with YOUR data a few times and then you will become an expert it.
Next step is to teach it to your team and friends save them loads of time.
Finally, show this to your Learning & Development as well as Process Optimization team and ask them to make this a standard operating procedure as well as a part of induction program for new employees.