Preamble
A few days ago, I started writing this blog post because of a problem I encountered using a big data set in Tableau. The following day, after having written most of this blog post, Tableau released version 8.2. In the 8.2 release, the problem I describe herein is supposed to be solved with a new type of database driver that replaces the Jet Driver. In Part 1 of this post, I talk about how I handled a big data set in a CSV file in excess of 4 Gb prior to version 8.2. In Part 2, I investigate the performance of the new version 8.2 database driver.
If you like this article and would like to see more of what I write, please subscribe to my blog by taking 5 seconds to enter your email address below. It is free and it motivates me to continue writing, so thanks!
Introduction (Original Post)
I need to offer you a quick apology. This blog post is not yet finished. It will take me a while to complete so you might have to check back later for updated information. Normally I don’t write posts this way, but I’m about the embark on a mission to document the methods I am using/creating to handle some pretty big data in Tableau. By big data, what I mean is text-based data that exceeds a file size of 4 gigabtyes (Gb) in a csv format. I knew this day was coming and I have written some previous posts in this regard (click here to review some of these older posts). Now that this day has arrived, I’m into problem solving mode to find some solutions to real-world business problems. Since I don’t have an IT department at my disposal and I don’t have database gurus sitting next to me, I get to resolve any issues I face mostly by myself (except for the support I get from the excellent Tableau community). Finding these types of solutions is not always easy and is made more difficult by having concurrent project deliverables that have to be met.
Why is a File Size of 4 Gb Significant?
In previous years of using the 32-bit version of Tableau Desktop, the size of the data sets I could sent to Tableau were limited by the 4 Gb of addressable memory space. I wrote that I was comfortable using “typical business data sets” of up to about 20 million records (with 10 to 15 fields per record). In 64-bit Tableau, you can address more memory than you can put on a computer, so we are no longer limited by random access memory (RAM). However, if you use the Jet database driver for loading data into Tableau like I commonly do (prior to Tableau version 8.2 circa June 2014), once your csv file goes beyond 4 Gb, problems are encountered. I recently resolved a problem in this regard (click here for that post) but now I’ve hit a barrier that cannot be overcome because of the Jet driver limitations. For more information on these limitations (click here) or (click here) or (click here). Tableau is great at giving us these technical notes to help us quickly resolve problems we encounter. However, even if you know about these limitations, creating a workable solution for your big data set may not be as easy as you think.
At this time, I have not been able to find a step-by-step guide for making the transition from loading data from csv files into Tableau to loading data from SQL databases into Tableau. The reason that this transition is necessary is that the Jet database driver cannot reliably load data from csv files in excess of 4 Gb in size. For people that typically use SQL databases in their day-to-day operations, this type of issue probably has never occurred to them. For people like me that receive a wide range of data from a diverse client base (many times in csv format), loading data into SQL databases is typically not necessary since I’m not managing that data – I’m only reading and interpreting their data. As time goes by, however, clients are asking me to analyze larger and more diverse data sets which will likely push me into using more SQL databases.
The Size of this Business Case
In my current business case, I have about 8 years of department-level sales data from a few hundred stores that spans about 20 million records, By itself, the sales file has about 25 fields per record. In a separate file I also have store attribute data such as location information, size, operational settings, etc, with another 25 fields per record for the several hundred stores. Normally I just join these in Tableau and I am able to do geographical analysis as well as time series analysis without any trouble. In this business case, however, this plan did not work as usual for the reasons explained below. To get this data loaded into Tableau, I had to learn how to do the things documented in this post.
Beginning Work With the Sales Data
I was able to load the 20 million record sales data set by itself into Tableau and was able to work with it without any trouble. As usual, Tableau flew through creating the extract and I was merrily working the the data in a few minutes. The trouble began when I wanted to add a second table to the analysis that included the store location information so that I could do some geographical analysis. Tableau was unable to complete the data join due to Jet database limitations and it failed to create a Tableau data extract. At this point, I could have broken the sales database into two parts, did two separate joins, and moved on with the analysis. Although this approach would be feasible, it wouldn’t be optimal and would require me to do more book keeping than I would prefer to do. When the data join failed, I did not realize that the reason for the failure was related to the Jet driver limitations. Now that I know this, I would approach the next step differently than I describe below.
Joining the Store Geographical Data with Sales Data
I wanted to join the store geographical data (i.e., location information) with the sales data to be able to do my normal work in Tableau. My first attempt was to join the two csv files in Tableau. This didn’t work due to the Jet driver limitations, so I chose to join the data sets together using Alteryx. This worked perfectly but created a new csv file that now had about 50 fields per record at 20 M records, which was greater than 4 Gb in size. With the sales and store attribute data combined into one file, a lot of extra overhead was introduced to the problem. Now this file was too large for Tableau to create an extract, so I was back to being unable to perform my work.
The next time I encounter this type of situation, I would not do the join in Alteryx. I’ll simply load the store data and the sales data into two separate SQL databases and let Tableau do the join for me to avoid the extra overhead created by the joining of the two files in advance. In other words, there is no reason to carry the 25 fields of store attribute data like latitude and longitude across 20M records since the attribute data does not change over time.
Getting the Data Into SQL
I am not an SQL guru, so the steps explained in this section are from the perspective of a relative novice. I have done work on various SQL platforms such as postgreSQL, but I haven’t had the need to do much programming in SQL because the data I use typically gets shipped to me in flat file formats and Tableau has been able to handle just about anything I have thrown at it. Until now, that is. So to get this data set into Tableau, I had to load an SQL platform, load the data into a database, and link Tableau to this database. This took me a couple of hours to do.
First, I had to download the 1 GB+ Microsoft 2014 SQL behemoth. Comprehending this system is a bit like learning to speak Icelandic within a week. If you don’t already know about the Icelandic language, it is purportedly one of the most difficult languages to comprehend and speak, unless you are an autistic savant like Daniel Tammant. Learning the Microsoft SQL parlance might be easier than what Daniel had to do, but probably not by much. Installing the 2014 version of SQL is an adventure, as explained in this blog post. Luckily I have done enough installations through the years to get it done and I had an operating SQL platform within an hour.
The second step involved using the 2014 SQL Server Import and Export Data utility. Using this program, I was able to load the 20M database into an SQL database for usage with Tableau. The speed of import was decent as it took only a few minutes to load the file. The utility was not very flexible in specifying the incoming field data types, however, and this shortcoming became a clear limitation in step three of my process.
In the third step, I connected Tableau to the SQL database. This was relatively easy to do and I only had to make a couple of guesses to get it done. The primary problem became very clear once Tableau created the extract. All fields in the SQL database were viewed by Tableau to be strings and every one of them was placed on the dimensions shelf. Why was this the case? Well, apparently, the Import and Export data utility does not do any automatic data type recognition during the csv import. Every field in my example was treated as a varchar string of length 50. Not only is this handling of the data inefficient and inaccurate, it is highly wasteful of disk space! To be able to use the data in Tableau, I had to use the Tableau data conversion functions and place the fields on the proper dimension and measure shelves. This was not a lot of fun to do and adds significant overhead to this Tableau file.
Upon further reading and a tip from Zen Master Jen Underwood, it became clear that the MS Import and Export data utility can be configured to be more efficient by importing data by using a “schema.ini” file. In this file, each data field is specified prior to import so that the SQL database is loaded with the proper data types. When searching for an example of this file, however, I drew a blank and had to pass on this step because I had real work to do. I will figure out how to do this and document the process in this blog post at a later date.
Conclusion
I was able to get my big data example loaded into Tableau by using the 2014 MS SQL Server platform. This platform will allow me to continue using large data sets with Tableau. I still have a lot to learn to be more efficient with this work flow but for now, I’m moving on to testing the new Tableau version 8.2 database driver.
As a sneak peak at part two of this blog post, I just completed loading this business example into Tableau version 8.2 and comparing it to version 8.1. This example is the pre-blended >4 Gb data file generated from Alteryx, with over 0.75 billion pieces of data included in the file. Tableau 8.1 created the extract in 13:08 and Tableau 8.2 created the extract in 6:49, for a wonderful speed improvement in creating the data extract of nearly 50%. Additional testing on data joins are ongoing and will be reported in part 2 of this post.
Pingback: Notes on Working With CSV files Above 4 Gb in #Tableau, Part 2 | 3danim8's Blog