In this blog series, we’ll explore some easy ways to reduce the size of a Power BI model.
Our previous blog posts:
The impact of duplicates in a column
In our previous blog posts, we showed you how to resize the model to some extent with a few easy fixes. Removing unneeded columns is an important step in this process, because Power BI mainly works with columns.
Sometimes, though, you need to do the exact opposite: instead of reducing the number of columns, it can be more efficient to split one column into several. At first glance, this seems to contradict the last blog post, but it doesn’t
Let’s take another look at the result in our last post:
In that post, we mainly focused on the total size and didn’t pay much attention to the size of each individual column. The screenshot above shows that the total size of all columns is about 16.8 MB, of which 14 MB are squeezed into a single column: SalesDate.
To understand this, we need to look at the first value in PivotTable: the ‘cardinality’ or amount of unique values in this column. You’ll see a correlation between the number of unique values and the size of a column. In the background, Power BI records the number of unique values in most columns in a table. In other words: the more duplicate values, the smaller the underlying table, and the less space the relevant column takes up.
Even though the field is named SalesDate, it contains two values: the date and the time of a sale.
If it contains data from the last ten years, this theoretically means that the column contains 10 * 365 (# days in a year) * 86400 (# years * # days per year * # seconds per day) different values. If we were to split this column into two (SalesDate and SalesTime), these two columns would have a much smaller impact because they’d contain more duplicate values.
Additionally, you’d need to consider whether sales time is a useful metric at all. If not, you could just remove this column. If you do need to keep track of sales time, you could ask yourself if you really need to track it down to seconds. If it’d be sufficient to know if a sale took place between 10-11 am or between 11-12 am, this column will only contain 24 unique values. If we use Power Query to split these data, the result will look like this:
If we use Power Query to split these data, the result will look like this: Wze could change the time from 17 to 18 If you check the memory in Excel now, you’ll be amazed by the result:
If you check the memory in Excel now, you’ll be amazed by the result: the size of our data model has been reduced from 16.8 MB to a mere 4 MB.
The SalesID column
In our last post, we discussed removing unneeded columns. I should’ve removed the SalesID column then, but I kept it around for this post.
The screenshot shows that the size of the SalesID column is about 1.7 MB out of a total of 3.9 MB. This is because SalesID is the primary key in this table – in other words: every value in the SalesID column is unique. This makes its cardinality equal to the number of rows in the entire table.
Ask yourself: “What do I need SalesID for in my model?” If you can’t come up with an adequate answer to this question, it’s best to remove this column from the model.
So, do not think : “this might come in handy later on, i will leave him in my model” When thinking like this you should better remove the column the less columns there are, the better Power Bi can do it’s job.