Using Microsoft Excel for data science is like using a hammer to put screws in. It works, but it’s not what it’s meant for. You have to know when to apply the right tool(s)
Microsoft Excel is the marmite of the software world, there are those who love it, and those who loathe it. Some people will avoid using Excel at all costs, while others show you their pivot charts and V Lookup formulas with the same expectant look as a proud parent showing you a picture of their new-born child. What is it then, that makes it so divisive but also, so ubiquitous? Also, is Excel a valid tool to use for data analysis, or, is it really like using a hammer to put screws in. In this blog I will discuss the pros and cons of using Excel for data analysis, as well as alternative options to be considered whenever you are starting on your next data analysis journey.
Firstly, I believe that one of Excel’s biggest strengths but also its biggest weakness is its ease of use and flexibility. How many times have you needed to calculate something and instead of reaching for the calculator, you instantly open a new workbook and start putting some data and formulas together. Then, from this relatively basic beginning, has the same workbook been expanded to include other data for analysis, before eventually becoming so cumbersome and confusing you can’t remember its original purpose. This flexibility also makes Excel a useful but troublesome tool for data analysis.
If you want to quickly start tracking and analysing data then I believe Excel is a good place to start, you can build a very useful tracker which will be able to take a decent amount of data and still perform ok. Just make sure you’re using the most up to date version of Excel and don’t get your rows and columns mixed up to avoid a repeat of the track and trace fiasco! I have worked with Excel trackers and Power BI reports and if the tracker user is disciplined then you can absolutely get some useful analytics and insights from your data. One of the problems is when users don’t understand the relationship between Excel and Power BI, and in trying to be helpful by utilising Excel’s flexibility they start adding or deleting columns, renaming sheets or files and in doing so completely break the refresh on your Power BI dataset. If I had £1 for every time, I had to fix a Power BI query because of this I’d be writing this blog from a tropical beach somewhere.
Additionally, as you start inputting more and more data into your workbook, the file will increase in size, as will the amount of your computer’s RAM which it uses. This means that the workbook’s performance will suffer, using it will slow down your computer and the file will potentially crash intermittently meaning you could lose any unsaved data. Obviously, this isn’t very good if you want to collect and analyse big amounts of data.
Data integrity is also a big problem with using Excel trackers, letting people free type into cells risks having a lot of different values which make data analysis much more difficult. It also leads to unexpected errors in Power BI which the person using the Excel workbook probably never envisaged. For example, people will sometimes put N/A in a date column not realising their report refresh will fail. Another even worse example than that is someone putting in 31/11/2020 and Power BI not having the faintest idea what was going on.
Finally, when you use Excel for your data you also lose the ability to easily integrate and utilise data from multiple sources. Which leads to more cumbersome analytics with data spread across multiple sheets in the same file and then all brought together into some sort of master sheet.
If you are wanting to use Excel for your data then, if you follow some simple basic rules then you will be able to undertake some decent analytics but be mindful that you will be limited somewhat.
Those basic rules are
- Restrict changes to the file except for where data must be input,
- Don’t let users change the name or location of the file
- Worksheet or columns names must not be changed, these are how Power BI locates the data tables and field names
- Don’t allow the deletion or creation of new columns, this confused Power BI when the query is looking for a column that isn’t there any more
- Use data validation as much as possible, this will improve data integrity and reduce the likelihood of the Power BI refresh failing
- Be mindful that Excel’s performance will be reduced as your dataset increases in size
- Use Excel as a tracker, rather than a data management tool. It’s great when used as a tracker and is very easy to get implemented into a project. Just don’t expect it to become some sort of commercial management tool for example.
What are some options for you when embarking on your data analysis journey then? Firstly, any solution in my opinion needs to be scalable, it’s no good having something which fails when it has a large volume of data, equally, you don’t want something which doesn’t scale down when data volume is reduced and costs remain high. Therefore, I believe you need to utilise the Microsoft Azure suite in order to undertake some useful and effective analysis. Whether that is a data base, data warehouse, data lake or whatever other data dwelling you wish your data to reside. Azure will allow you to create a resource easily, scale up or down instantly and helpfully will give you a cost breakdown so you can control your spend. Also, you can use Logic Apps to start interacting with other systems and software, creating flows to add another layer to your data management. The data factory provides powerful pipelines to integrate data from various source systems into one place, you can then use the data flows to clean and optimise your data set allowing for a much more efficient and effective analysis.
In regard to the hammer and screw analogy from the start, I believe Azure is the electric drill of the data world. You have access to a lot of different ‘bits’ depending on what you are trying to achieve with your data, you can change the speed if you need more or less power, and for those instances where you need to get through some heavy data sets and you just need to put it in hammer mode and let it do the hard work.