Processing Global Travel Data – Part 3, The Hard Way Via Chas

My First Guest Blogger

This article represents my first guest blogger in 3danim8’s blog. If you have watched Part 1 of this series on processing global travel data, you were introduced to Chas and his new Camero. Without further delay, here is Chas in his first technical article.

Introduction

Hello World! I have never written a blog post before, but this seemed like a good way to get started. This blog post will be the third (and final) installment in a series focusing on global travel data. In part one, Ken explained the specifics of our problem and developed an Alteryx workflow to carry out the solution effectively.

However, as Ken mentioned in part 2, this workflow is not the only method that could be used to tackle this problem. We thought that it would be interesting to compare and contrast the Alteryx method we used on Saturday to the methodology I used in my initial attempts to explore and solve the problem.

Who Am I?

Before going further, I will reintroduce myself. My name is Chas Latch. A little over two months ago I graduated from Baylor University (sic ’em bears) with a degree in Computer Science.

I will also take the time to issue a small disclaimer: I am very much a newcomer to all of this. The first time I heard the word Tableau was when I started my new (and first) permanent employment at General Motors. In fact, I opened the program for the very first time only three weeks ago. In that very small time span, I have not even scratched the surface of what this tool can do.

However, it is immediately apparent that this tool is something extraordinary. I want to thank Ken as well for letting me write this here. I look forward to learning even more from this blog and am grateful for the opportunity to contribute!

The Task

So as Ken mentioned in part 2, this task came to Sylvia, Roy, and me in a relatively high-level email. The task we were interested in was processing the itineraries of several travel records. In particular, we wanted to parse the itinerary string into its parts.

So if a particular travel record was from Austin to New York, its corresponding itinerary string would be “AUS/JFK”. So we want a process to take this string and extract the two sub-strings “AUS” and “JFK”. This may not seem like a big deal, just split on the character “/”. Duh. Unfortunately, there is some more complexity here. For example, if a record were a trip from Austin to Miami with a connecting flight in Dallas, the itinerary would be “AUS-DFW/MIA” and the parsed components would be “AUS”, “DFW”, and “MIA”. For more information about this, see the whiteboard session Ken did in part 1 of this post.

Whenever I see the word “parse”, I immediately think of regular expressions (regex). Many readers may already be familiar with regular expressions and can probably skim the remainder of this paragraph. So what is a regular expression? There are probably many ways to answer that, but the simplest way to think of them is this: a regular expression is simply a way to represent patterns within a string of characters. It does this by using a specific set of symbols and syntax. The particulars of regex syntax is beyond the scope of this post, but more information can be found here.

So while the particulars of regex are not necessary, what is important is what they allow us to do. By allowing us to efficiently recognize certain kinds of patterns in strings, we can begin to define some operations to identify those patterns. For example, we could create a regex_replace(pattern, target, substitute) function that takes the first occurrence of a pattern in the target and replaces it with a substitute. We can also define a regex_extract(pattern,target) function that simply returns the first occurrence of a pattern in a target string.

Ok, so what? Is it possible to craft a regex pattern that would extract the component parts of an itinerary string? I was confident that the answer was yes. However, I would first need to find a way to run regex on the data. I quickly started looking into various ways to do this. I was excited to see that Tableau did have some support for regex in calculated fields. However, I was disheartened to learn that this was a new feature shipping with Tableau 9, and I would only have access to Tableau version 8. I would have to leave Tableau to do regular expression work.

My Solution

I feverishly started searching for alternative ways to achieve regex on my data, even as a proof of concept for further refinement. Ultimately the quickest solution I could come up with was to dump the entire data set into an Access database. Unfortunately, I immediately ran into problems.

Access seems to limit how many rows of data may be imported from an Excel spreadsheet. From what I found, this limit is around 65,500 rows (our data set had around 100,000 rows). So I begrudgingly partitioned my data into chunks of around 60,000 rows and imported the two sets.

Once the records were in a table within Access, I used SQL and some regex functions written in VBA (Access’s native scripting language) to do an initial parse of the itineraries. Figure 1 shows the initial imported table, and Figure 2 shows the table resulting from the initial parse. We were on the right track!

Figure 1 – Example itinerary data.

Figure 2 – Itinerary data parsed using regex expressions.

Clearly, this was a step in the right direction. However, what is not clear is the complexity of the regex at the core of this query. Figure 3 shows the SQL query used to do this parse. Yuck. Yes, this query is UGLY.

Figure 3 – The regex queries needed to complete the itinerary breakdowns.

It may not be clear what is going on with the regex expressions shown in Figure 3. Within the sub-query, the first column we select is simply the unprocessed itinerary string. The next column is the first IATA code found in the itinerary. Next, we need to get the second IATA. To do this, we simply remove the first IATA (by replacing it with the empty string) and repeat the method used to get the first IATA. We can simply keep repeating this process of stripping off the first N codes to read the N+1 code for any N we choose.

For this data, I happened to choose a value of 10. For each itinerary, 10 columns will be created containing the corresponding leg of the trip itinerary. I was thrilled! Just like that, I had a lovely reference table.

I powered up Tableau and connected to the new database. I joined the table containing the original travel data with this new table. The problem was, this didn’t work. Ultimately the visualization that I think everyone had in mind was a world map with different trips represented as paths. Unfortunately, a path map requires that the path points be on different rows.

The path points correspond to the component IATA codes making up the itinerary. Ok, so we need to not only parse these itineraries onto their columns, we then needed to transpose each of these columns onto their own rows so Tableau could create the path map we wanted. This may seem like a complicated notion (and it is), but as shown in Figure 4, the SQL required to do this doesn’t wind up being too bad (note that this Figure only shows a portion of the query, but you get the idea).

Figure 4 – The SQL commands needed to transpose the travel legs.

So what are the costs of an operation like this? Time costs seemed to be reasonable. On my laptop, this query completes consistently in about 2 minutes. What about space? Well, some expansion of the data is expected. After all, we are parsing each record and expanding it into its parts.

However, the above query is guaranteed to increase our data to ten times its original size. This seemed gratuitous to me. Sure, some complicated itineraries might be using all ten rows allocated to them, but many of the itineraries are made up of only 2 or 3 parts. These shorter itineraries would be carrying around 7 or 8 empty rows that were unneeded (note the empty columns in Figure 2). Again SQL comes to the rescue as shown in Figure 5.

Figure 5 – SQL used to clean the empty records.

So now we could go into Tableau and make the visualization we had in mind. A path map was possible at this point. However, there was one more little wrinkle. What if we want to calculate the distance of one of the legs of a flight? I thought for sure that this would not be an issue with Tableau. I was confident that even if Tableau was not able to calculate the distance between two points on the earth out of the box, then I would be able to create the necessary calculated field to do the math on my own.

This, however, is where the snag occurs. We were using the custom geocode method described in one of Ken’s earlier blog posts. By utilizing this method, we were able to have Tableau translate the IATA codes in our data into their corresponding latitude/longitude pairs, which is exactly what this technique is supposed to accomplish.

However, Tableau does not currently allow a generated measure to be used within a calculated field. Because we were using one of the convenience functions built into Tableau, our latitude and longitude values would be inaccessible to any calculation we wanted to perform.

Without access to latitude and longitude, calculating the distance between two IATA codes would be impossible. So, what does this mean? We must create another reference table containing IATA’s and their corresponding coordinates, as in Figure 6.

Figure 6 – Airport IATA codes and latitude and longitude.

We can then specify a join to this table when we connect in Tableau. Note in Figure 7, that because we need the location of both the origin and destination IATA codes, we will need to join this table an additional time. Our data should now contain all the information we need to create a calculated field finding the distance between the two points of a leg.

Figure 7 – Tableau table joins.

An example visualization showing this is shown in Figure 8.

Figure 8 – Example flight segments.

Completed Solution

So there you have it. Using the above methodology, we can parse travel itineraries the way we want to. However, this is an elaborate solution that was placing a pretty heavy load on my Tableau application. Was Tableau performing what I asked? Absolutely.

However, multiple “Out of Memory” errors and a few other stability issues (Figure 9) were a pretty clear indicator that Tableau was having to fight with some aspects of this approach to give us what we wanted.

Figure 9 – Tableau was in trouble!

It is highly likely that the data joins used in Tableau were the culprit. Nevertheless, I am sure that there are many aspects of this process that could be optimized, and the methodology used may even be considered a viable way to do this kind of parsing. Something to keep in mind is that these optimizations take time to implement, as did the initial development.

It took three days of pretty solid work to create this solution. One can imagine it would take many more hours before this was ready for “prime time.” Further, can you imagine trying to maintain this process? It is complicated enough for me to keep this process straight in my head, and I am the one who wrote it. I can’t imagine being handed this process and working out what it is doing and why on my own.

Comparison to Alteryx

When Ken and I sat down on Saturday to try and recreate this process as an Alteryx workflow, I was amazed. It took us less than 2 hours to recreate what had taken me three days to do. That still shocks me. I realized that I could either spend an entire day of work manually working through this process or I can accomplish the same thing in only an hour of tinkering in Alteryx. Figure 10 shows the final workflow we developed to solve the same problem.

Figure 10 – The final Alteryx workflow we developed to solve the same problem.

And that’s not even the whole story. The Alteryx workflow was not only faster to develop, but it was also faster to perform, by far. If you watched the videos in part 1, you saw just how fast Alteryx could complete what took my query minutes to perform. On top of all of this, the icing on the cake is that this workflow is much easier to maintain compared to the approach detailed above (at least in my opinion). The whole point of this post was to show just how much work it took to accomplish this task versus how easily Alteryx completed this work.

Wrap Up

All the gains seen by using Alteryx in this situation were unreal. By using this tool, we eliminated the need to write a single one of the SQL statements used by this method. What’s more, the regular expressions needed by Alteryx are also GREATLY simplified from what we had to use here. I believe the only real regex pattern we used was “[A-Z]\{3\}”, which essentially matches the first three consecutive upper-case letters it finds (IATA codes happen to be formatted this way). Clearly, this is an incredible simplification of the regex used in Figure 3.

Summing up, in less than 2 hours spent in Alteryx we accomplished what had taken me three days to do manually. Perhaps the most impressive measure of relative performance is the amount of time it takes to execute the query. With my solution, it takes around 2 minutes to parse and format the data. Compare this to the 10-12 seconds that it takes Alteryx to do the entire job, and you see that Alteryx can complete the task in only 8% of the time. I know that I was utterly shocked by this.

If you want to get close to the same effect we observed, I encourage you to skim over this post again very quickly and then go re-watch the part 1 videos. Tell me, which process would you rather use?

For me, there is no question. By using Alteryx, we could focus on what we wanted to do to our data rather than how we would do it. Under the hood, the methods that Alteryx is using may not be all that dissimilar from what we did in this post. This is an entirely speculative statement, as I have no idea how the black magic inside Alteryx works. However, this is precisely the point that I am trying to drive home.

One of the biggest strengths you get from using a tool like this is that you don’t need to be concerned with the how. We tell the tool what operation we would like done, and Alteryx takes it from there. This frees us from the many hours it would take to design, implement, debug, and maintain the “how” parts of the solution. Instead, we focus on interpreting the results of our analysis and determining how we wish to proceed next.

Alteryx just boosts your productivity more than any other tool I have been exposed to thus far. Not that Tableau can’t be used on its own to accomplish this. It can do this without Alteryx (as proved by the methodology above). Yet if we start to offload the non-visualization tasks from Tableau to a more suitable back end, we start to see what a crazy powerful combination Tableau and Alteryx can be.

If you managed to stay with my post long enough to reach this point, thanks for reading. I hope that this was interesting to read. I look forward to continuing along this journey of Tableau goodness and hope that this post is the first of many more to come as I slowly discover the awesomeness that is data analytics.

5 thoughts on “Processing Global Travel Data – Part 3, The Hard Way Via Chas”

Jorge says:

September 24, 2015 at 6:33 pm

Great post Chas!. Very good explanations. Yes, I read it all 🙂

Loading...

- 3danim8 (aka Ken Black) says:
  
  September 24, 2015 at 6:35 pm
  
  Thanks Jorge!
  
  Loading...
  
Dean says:

October 26, 2016 at 4:50 am

Fantastic post Chas! This is just what I needed to show my students! Sic ’em!

Loading...

- 3danim8 (aka Ken Black) says:
  
  October 26, 2016 at 3:02 pm
  
  Thanks for reading, Dean. Glad it was helpful for you and your students!
  
  Chas
  
  Loading...
  
Pingback: 3danim8's Blog - Proof of How Tableau Has Transformed Careers

Data Blends

Data Science Problem Solving Techniques

Processing Global Travel Data – Part 3, The Hard Way Via Chas

My First Guest Blogger

Introduction

Who Am I?

The Task

My Solution

Completed Solution

Comparison to Alteryx

Wrap Up

Like this:

5 thoughts on “Processing Global Travel Data – Part 3, The Hard Way Via Chas”

Leave a ReplyCancel reply

My First Guest Blogger

Introduction

Who Am I?

The Task

My Solution

Completed Solution

Comparison to Alteryx

Wrap Up

Share this:

Like this:

Related posts

5 thoughts on “Processing Global Travel Data – Part 3, The Hard Way Via Chas”

Leave a ReplyCancel reply