Introduction
Sorting data is fundamental technique and is a skill that anyone that frequently works with data needs to possess. I have sorted a lot of data through the years, for many different reasons. Sometimes the only way to get a job done when working with a wide variety of data sources that might include periodic data updates and restatements, includes sorting data. As data sets continue to get bigger, the challenges associated with sorting continue to increase.
In this article, I mention four sorting methods I have used to handle data of various sizes, formats and data types. These methods include MS Excel, a custom program called “Dsort”, Vedit, and Alteryx. Benchmark results for sorting big data examples are included, with one example having over 2 billion records.
If you like this article and would like to see more of what I write, please subscribe to my blog by taking 5 seconds to enter your email address below. It is free and it motivates me to continue writing, so thanks!
Motivation For This Post
In my previous and current lines of work, sorting data is a common practice. There are so many reasons that I have had to sort data that I could write pages describing why I had to perform the sorts. If you do a Google search on sorting large files, however, the results are less than informative. The results you receive are not exactly up-to-date (Figure 0) and do not give you any insight on how to sort big data! There is a big hole in available information on sorting big files and that is why I am writing this post.
To explain the need for information on this topic, I offer a recent real-world business example that underscores the necessity of sorting data.
The Need For Sorting
In a nationwide process improvement experiment, I received from a client, a large number of files that represented about three years of historical data from over 2000 store locations. This data was transactional-level information and represented about 13 million records of credit card charges from the stores. New stores were coming on-line all the time and the database was growing monthly.
During the multiple-month study, I was to receive incremental weekly updates of the data rather than receiving the entire data dump week after week. It is common when working as a consultant that incremental data updates are used rather than complete data dumps because too much time is needed to write and transfer the entire database every week. Usually an automated process is set-up to create the weekly update and transfer the new file to me via ftp.
Each week, tens of thousands of new records were sent to me to extend the data set as we conducted the experiment. This basic set-up was fine because theoretically all I had to do was append records to the big file.
The problem was that not only did I receive new weekly data, I also received re-stated data records from previous time periods. The restated data was needed because customers returned items or other various data corrections had been applied. The restated data had to be replace the existing data in the big data set. This meant that the potential for duplicate records existed every time I appended data. To eliminate this possibility, data sorting was a necessary technique as part of the workflow I developed to update the big data set every time new data arrived.
Four Sorting Methods
I only want to spend a short amount of time discussing three of the four sorting methods included in this post. Although I use these first three tools frequently, Excel, “dsort”, and Vedit all have shortcomings that will keep you from using them on big data sets. The first three methods have their purpose, but if you have a big data set that needs to be sorted, you will not be using those techniques. If you are only interested in sorting big data, jump directly down to method #4.
1. MS Excel Sorting
Current versions of Excel can sort up to essentially 1 million rows of data. It can perform multiple level sorting and can sort left to right or top to bottom. For small data files, Excel sorting is great. Excel even has a “remove duplicate record” feature that is handy in many instances. Excel is also great at hyper-threading, which allows it to do big operations fairly quickly when all the computational cores on a machine are employed. Improved hyper-threading performance is experienced if you do a large number of vlookups or perform big sorts.
I have encountered one strange issue with the 64-bit version Excel. The issue has occurred enough times for me to start a blog post called: “Excel 64-bit – is it a good boy or bad boy?”. I haven’t completed this post because I cannot reliably replicate the strange behavior that I have experienced. Sometimes, 64-bit Excel simply will not sort files that are moderately big. My example had about 500K rows and 200 columns, and Excel repeatedly failed to sort it. My other computer has a 32-bit version of Excel and it sorted the same file just fine. It is just one of those weird behaviors that you should realize might happen to you one day if you are using 64-bit Excel.
2. Dsort Sorting
Back in the mid-1980’s to 1990’s, sorting big files was really troublesome. There weren’t many tools available to do general sorting operations for the numerical data sets we created as part of numerical modeling studies. Since I had to sort a lot of different numerical modeling files, I worked with a friend of mine to develop a general sorting program in the C language. We called the program “Dsort”, for “Data sorting”. We were able to sort very big numerical modeling files using multi-level sorting by using this code. This code was principally intended to sort files containing floating point numbers, such as numerical model output files having data structures like x (coord),y (coord), z (coord), data value, and time.
Click here if you want to download a copy of the source code. I have had to place the source code onto an Excel workbook to get it into this content management system. This code is written for 32-bit operating systems because that is all that existed at the time, but the code could easily be modified to work on 64-bit systems for greater sorting capacity. If recompiled for 64-bit operations, the sorting capacity of this code would only be limited by addressable memory on your computer.
3. Vedit Sorting
I have written a lot about Vedit over the past couple of years in this blog. Vedit is a programmer’s editor extraordinaire. I have used it tens of thousands of times for so many different things that it is indispensable to me. Although Vedit will allow you to sort ASCII files (*.csv, *.txt, *.dat) that span millions of rows, there are limits to what it can sort because it writes temporary files as part of its sorting process. These temporary files get written in subdirectories and these subdirectories can get filled-up when you are trying to sort files having tens of millions of lines.
Vedit allows you to sort in multiple ways and also allows for duplicate records to be removed. Although Vedit isn’t designed as a sorting specialty program, its brilliance shines through as it has allowed me to do some massive sorting operations through the years. However, I have pushed Vedit beyond its limits a few times over the past couple of years and I am now thankful for having the next weapon in my sorting arsenal – Alteryx.
If you are serious about sorting big data files, go directly to Alteryx. Here are a trio of examples of the power of Alteryx when it comes to sorting big data.
4. Alteryx Sorting
I have three examples to demonstrate to power and speed of Alteryx when it comes to sorting data. The first example has 62 million records and 20 fields per record, with a csv file size of 8.62 Gigabtyes. The second example has over 411 million records and 8 fields per record, with a csv file size of 19.3 Gigabtyes. The third example has over 2.08 billion records and 8 fields per record, with a csv file size of 75.4 Gigabtyes. Alteryx is the only tool of the four listed above that would be able to sort these examples.
Figures 1a and 1b shows the very simple workflow that was used to sort the first example. As you can see the setup is very easy and you can tell Alteryx to generate whatever type of output you would like to receive. Figure 1a is a three-level sort and Figure 1b is a four-level sort using the same data. You can see from the figures that only 25 additional seconds were required to perform the additional sorting level.
Figure 2 shows the workflow that was used to sort the second example. Figure 3 shows the CPU and memory usage of my computer during the sorting test. The CPU usage was spread out over multiple cores and RAM consumption was above 40 Gb (not all of which was related to Alteryx). Figure 4 shows the reported memory usage during the Example 1 trial as reported by Alteryx during the running of the sorting test.
Figure 5 shows the workflow that was used to sort the third example. Figure 6 shows the CPU and memory usage of my computer during the sorting test for example 3. The CPU usage was spread out over multiple cores and RAM consumption was above 45 Gb (not all of which was related to Alteryx).
Figure 7 gives some performance benchmarks for each of the three tests. The first two tests (1a, 1b) were both completed in less than 7 minutes and the second test took over 18 minutes to complete. The biggest test, Example 3, took just over 2 hours to finish. Can you imagine how long it would take a person to sort over 2 billion records?
One interesting result is shown in the “records per second” column of Table 1. Since Example 2 had 8 columns of data compared to 20 in the Examples 1a and 1b, there was more than a 2x increase in processing speed of this example. These examples were also entirely different data files, with big differences in the data types and fields lengths included. Also notice that the sorting performance doesn’t degrade significantly as the file size rises into the billions of records. This is a primary indicator of superb software.
Final Thoughts
I have been working with many software tools for over 30 years. I have written hundreds of computer codes in up to 10 computer languages to accomplish many advanced computations. I have studied math and computer science theories very deeply through the years – I even read Donald Knuth’s work.
I understand advantages and disadvantages of different approaches when it comes to things like sorting. I also know that to get a job done, I do not have an unlimited amount of time to play around with research and to see which program is more optimized than another to do an operation like sorting. I do not have time to be a “purest” or someone that gets buried in the details of a particular method. I need things that work reliably, like Tableau and Alteryx.
I do not work for any software companies, nor do I write this blog to please anyone in particular. I am not biased in any way when I report results. I just “tell it like it is”. I even try new things when someone informs me of a new technology.
Such was the case a week or two ago when I received a blog post comment from someone that said something to the effect of: “You should try this program (to remain un-named), it kicks Alteryx in the naughty!”. I laughed so hard at that I felt compelled to give it a try. So today I did I gave it a try. I thought it would be great to compare this new tool to Alteryx in a sorting competition.
Unfortunately, the competition was over before it began. I registered, got an account, and then realized that this program is a cloud-based analytics platform.
If I were patient enough to take the time to learn how to build the sorting workflow, we could get some insight into how good this program might be. However, can you imagine how long it would take me to upload the 19.3 Gb Example 2 data set or the 75.4 Gb Example 3 data set? It would take hours! I do not have time for that, so the competition did not take place.
Although this program is something that I might investigate further because it is definitely very interesting, it is not going to be a day-to-day solution for me because of the way I work. The software response over the cloud and existing bandwidth for moving big data files is not sufficient for the types of things I do.
For industrial applications like I have shown in these examples, you need software that is resident on your computer and you need great computing resources. You cannot get away with playing on laptops or small desktops. You cannot hope to “work through the cloud”. You need professional-grade software, designed and built by people that are the best at what they do.
Alteryx is that company. Quit fooling around if you are trying to work with big files and trying to make sense of what the information is trying to tell you. Get Alteryx, learn how to use it, and kick your data in the naughty! You can thank me later for this advice.