Introduction
When you have used Tableau as much as I have, there are many funny things that you can discover about the behavior of the software. In this article, I show how the sort order of incoming data can confuse Tableau such that a data column is improperly interpreted and handled.
This example is not a fatal flaw, but it can drive you crazy if it ever happens to you. This is just one of those little idiosyncrasies that can happen with software. The program developers implement an assumption which is good for most cases, but may not be valid for all cases.
In this case, the assumption I am referring to is related to how far down into a file Tableau will read the data at load time. If the initial read is not complete, then the data interpretation can be flawed. The good news is that it is easy to fix the problem once you know it exists.
Background
I did this work a couple of months ago, documented it in a couple of videos, and then quickly forgot about it. The testing for this work was completed initially in Tableau Version 9.2. Tonight I stumbled upon the videos I made and decided to write this article in 30 minutes or less! So here it goes.
When you first introduce data to Tableau in the data source editor, a portion of the data file will be interrogated and an intelligent decision will be made by the software regarding the data types found in each column of the file. [Update the next day: Tableau reads the first 1,024 lines of a csv file to make those determinations, and 10,000 lines of an Excel file – thanks to Tableau’s Dan Cory for this insight]
Typically (probably >99.9% of the time) the data determination is perfectly accurate. There are times, however, when Tableau can be fooled. In this article I will show you an example of how this can happen.
The Sort Order Example
As shown in Video 1, I used Alteryx to sort the 2.3 million record data file in two ways. I did a two-level sort, with the first variable being sorted being a date field (column V in Figure 1). In column V, of the 2.3 million records, about 1.4 million of them contain Null dates and 900,000 of them contain actual dates. The reason for this is related to the type of record being stored in the file – not every record has this type of date.
The two sorting methods used are descending order and then ascending order for the date field shown in column V (Figure 1.).
Both of the sorted files have identical data. When the data is first introduced to Tableau, there are two different data interpretations made for the date field in Column V. This was the field that was used to do the sorting. The thing to realize is that both files have exactly the same data. The only difference is that they are sorted differently.
Influence of the Sort Order
As shown in Video 2, the sort order has a profound influence on whether Tableau makes the proper determination of the data type in the date field that was used to perform the sort.
Case 1 – Ascending Sort Order
In one case (ascending sort, with Null dates at the top of the file), the date field is determined by Tableau to be a text field. Even with typecasting the field to a date, Tableau does not recognize the dates that are stored in the lower 900,000 records in the file.
The only way you can see the dates in this case is to draw them as text strings. You will never be able to achieve date functionality from this column of data, even though there are over 900K records of valid dates!
Case 2 – Descending Sort Order
In the second case, with a descending sort order used on the date field, Tableau properly identifies the field as dates. in this case, Tableau functions perfectly with the data in column V.
Verifying The Behavior in Tableau Version 10
As I mentioned above, I did this testing a couple of months ago and forgot about it. To verify whether this phenomenon still occurs in Version 10, I did a quick test and the results are shown in video 3. The results in Version 10 are the same as they were in version 9.2, with Tableau being fooled by the sort order used.
I even spent some time to see if I could use data source filters at load time to correct the mis-behavior. I was not able to create a data source filter that solved the problem. Although it may be possible to resolve this problem in some other way like using data source filters or some other ingenious approach, having to do this at the time you are first loading your data into Tableau is not ideal. The easiest solution is to resort the data as I have explained.
Final Thoughts
Although this finding may not ever impact your work, if it does it can cause you to lose some time and possibly a little of your sanity. This issue might also be more likely to occur in larger data sets, like the one I used in this example. This mis-interpretation of the data type is not a big deal because it is easy to fix.
You need to make sure that when you introduce your data to Tableau for the first time, make sure you don’t have a large number of Null values at the top of the file. Sort the file in such a way that actual data values are near the top of the file. This will assure that Tableau properly interprets your data.
In discussing this with my co-worker Nate, he suggested that Tableau could implement some sort of random record reading (sample every N-records) for very large files. If this were implemented, the dates that were stored in the lower 900K records would be found and properly interpreted by Tableau at load time.
Additional Testing the Next Day
In a comment on this article, I learned today that Tableau reads the first 1,024 lines of a csv file when it first loads the information (10,000 lines in an Excel file). Based on this insight, I did additional testing of my file to see when the number of Nulls at the beginning of a file cause the malfunction.
I learned that Tableau worked fine if 999, 1001, and even 1010 nulls were placed in the beginning of the file. When I skipped up to 1100 Nulls, however, the malfunction was in play. Based on this, just make sure you do not have more than 1,024 nulls at the beginning of your csv file in any of your data columns. I have not yet tested this behavior in an Excel data source.
This was great, Thank you Ken.
Hi Joni,
I hesitated to publish this because it is such an obscure issue that I just happened to stumble upon one day. However, I learned a long time ago, that if I don’t write these things down, they will return and bite me with a vengeance at the least opportune time. When this originally happened, Nate and I spent a little while trying to figure out the cause, so usually that means the topics is a good one for an article.
Ken
Were you able to test whether the Alteryx Auto Field tool properly identifies and casts the field as a date? If so, then there wouldn’t be a need to sort the data, would there?
Pingback: 3danim8's Blog - How To Use Hadoop With #Tableau and #Alteryx