Test 2 – A Big Data Benchmark Study
In my experience, the best way to test a new computational platform is to smash it upside the head a few times. In the case of the new Alteryx AMP engine, there are a whole series of new algorithms to test because each tool had to be redesigned to take advantage of the available cores.
To effectively test something like this, you must design a workflow that uses a lot of different tool combinations and settings, and you have to hit the engine hard. You also have to be willing to commit yourself to doing some serious work.
In late 2019 and through the first quarter of 2020, I made a commitment to find out for myself whether the AMP engine was for real or not. I had previously done a study like this for the Tableau hyper engine, so I knew that this would be a labor intensive and time-consuming project.
Over the past few years, I had heard rumors of the AMP engine. Once the Beta version was made available for testing, I knew the time was right for me to get to work. As it turns out, this testing was good for both me and for the Alteryx AMP development team. I showed them a few things, and they taught me a few things in return. I would say the time spent doing this work was mutually beneficial.
The first thing I had to do was to assemble a credible workflow that would put AMP through the paces. Figure 4 shows the anonymized workflow that I created.
Although this batch macro looks somewhat innocuous, trust me when I tell you that there is a lot of computational activity within that workflow. Additionally, there is a couple of years of serious intellectual development and research that went into developing that algorithm. I was tempted to count the number of computations occurring within this one but there simply are too many to make an accurate assessment.
The next thing I had to do was assemble a wide range of data input to run through the workflow. Once again, two and a half years of data was assembled into 38 data sets that contained over 8.6 billion records. The total volume of data was in excess of 540 Gb in yxdb format.
Although I ran all 38 test cases as part of my research, I didn’t record the run-time metrics for each example the first time I ran them in the E1 engine. Since they were ran in a batch macro, the only recorded runtime I received is shown in Figure 5 (>35 hours!). After I received a beta copy of the AMP engine (aka, the E2 engine), I decided to pick seven test cases for comparison, which amounted to over 50% of the data volume at 4.37 billion records.
As shown in Figure 6, the test cases ranged from a low of 378 Mb at 5.8 million records (example 4) to a high of 108 Gb with 1.7 Billion records (example 38). As can be seen in Example 4, the computational speed nearly doubled from 6.1 to 12.1 million records per minutes. This means that for that test case, we can say that the AMP engine is twice as fast as the E1 engine, or it has shown nearly a 100% improvement in computational speed.
Furthermore, what these results show is that the AMP engine consistently outperformed the E1 engine, with between a 33% improvement for the biggest test case to nearly a 100% improvement for smallest test case. The peak processing speed jumped from 6.1 million records per minute to over 12.1 million records per minute.
This wide-ranging test gives me confidence that when I turn on the AMP engine, I’m going to be getting some fast performance even with very large data sets.
As I am learning through additional testing, however, is that results will vary depending upon the problem size, the tools used in the workflow, and the strategic design of tool ordering. This is an Alteryx study area that is ripe for additional testing and research. You can bet I’ll be writing another article or two on this topic in the future.
Final Thoughts
Speaking of turning on the AMP engine, you might be wondering how that is done. Well, Figure 7 shows you how to do it. To learn more about AMP, click this link.
All you have to do is click the lowermost check box that is called “Use AMP Engine” as shown in the Runtime menu on the Workflow Configuration pane. This is akin to flipping the nitrous oxide switch in a 70’s era muscle car.
Just as you probably wouldn’t use nitrous oxide every time you drove that car, you might have workflows where using the E1 engine is just fine. This is why this is an option rather than a standard setting. As more tool compatibility development is completed over time, the AMP engine might become the default engine. For now, its usage is optional, but for me it is already a standard setting.
There might be conditions where the Amp engine can cause your workflow to become non-responsive. If that happens to you and you cannot get the workflow to run after clicking the checkbox in Figure 7, you will have to undo the usage of the Amp engine.
To do this, you will have to edit the *.yxmd file (the file that contains the XML commands for the workflow). Use an ASCII editor like Notepad to search within this file for the term RunWithE2 and set its value to false. Figure 8 shows an example of this term.
Upcoming in Part 2
In part 2, I will show how Tableau was able to consume some really large data produced by this testing. Tableau has really helped me comprehend and better understand this data than I ever thought I would be able to. Stay tuned.
Pingback: The Alteryx Advantages | Data Blends