How To Achieve Better Data Comprehension, Part 3

People speak of finding balance. If you have balance, you do everything OK. But to excel at your craft, you need obsessive, unbridled fanaticism. Not only does excellence require such commitment, it demands it. A life worth living is frenetic, disjointed, break-neck and quite fantastic. Balance doesn’t lead to happiness, impassioned dedication to one’s life purpose does.

Dean Karnazes – The Road to Sparta, 2017

 Summary3

Prelude – The Explorer versus The Exploiter

Throughout your career, you will acquire new skills for processing data as you gain knowledge and experience. There will be times when you learn new concepts, techniques, and practices. I will refer to these times as the explorer phases. During the explore phase, you will be focused on learning and you will attempt to apply the techniques to problems that range from simple to moderately complex.

There will also be times when you heavily utilize your most refined skills. I will refer to these times as exploiter phases. During the exploiter phase, you will be attempting to solve difficult problems to extract the maximum value you can from the data you are investigating. This is the time when all your hard work will pay off in big savings for your employer. As an exploiter, you will be able to tell data stories, uncover hidden truths in the data, and help your company make sound data-driven decisions. In other words, you will achieve great data comprehension.

Ideally, you will experience multiple iterations of each of these phases throughout your career because technologies change. If you are willing to continuously learn new techniques and approaches, you will be able to solve increasingly more difficult challenges as you move from an explorer to an exploiter.

As you age and your experience base grows, you will naturally migrate from being an explorer to being an exploiter. As shown in Figure 1, my path to becoming a data exploiter occurred because of continuous learning in math, science, programming, Tableau and Alteryx. I have studied most of these topics over my lifetime, with Tableau arriving more than 9 years ago and Alteryx about 4 years ago.  I suppose that 40+ years of computer programming has also helped make this transition possible.

Explore_Exploit_Full

Figure 1 – How I moved from a data explorer to a data exploiter over time. In this case, time represents more than 30 years of applied math, science, and advanced analytics, coupled with over 40 years of computer programming.


During this process, you will transform from the new kid on the block to the wise and experienced leader. Jobs that were difficult early in your career become more manageable as you learn to navigate the data world in which you work, whether it be in engineering, science, math, business, or a whole host of other fields.

Based on my observations, it is now rare that a person has the luxury of learning a select set of data skills and being able to use them throughout their entire working career. This is true because of the staggering amount of technological development that is ongoing across the world. This is especially true in information technologies where workers need to be data agnostic and comfortable working with a wide range of databases and computing/data storage platforms. To do this effectively, you need to learn to use mature and powerful computer software.

For this reason, I continue to work as both an explorer (see my Power BI series) and as an exploiter. I learned a long time ago that I can never stop learning because as soon as I choose to rest, I’ll get left in the dust. I also decided to never let that happen.


Introduction

This is my final article about achieving better data comprehension. In parts 1 and 2, I was a bit philosophical when discussing how I have learned to achieve better data comprehension. I didn’t mention tools that I use as I only compared my older work methods to the new methods I now use. If you want to read all three parts of this series, please start by reading Part 1 by clicking here Part 2 of the series can be accessed by clicking here. Fair Warning: This series took me 30 years to understand and 8 months to write, so it is long, but it is insightful.

This article explains and demonstrates how I have learned to improve my data comprehension in modern-day analytics projects in the scientific, medical, business, and other fields. The last five years of my 30-year career have been pivotal in allowing me to develop, practice and comprehend the techniques I will explain in this article. As with anything that requires multiple skills to obtain, you must study, practice and be dedicted if you want to reach the title of data exploiter.

I specifically explain how the combination of Alteryx and Tableau allow me to accomplish wonderful things with data in a fraction of the time compared to other, more traditional methods I have used such as writing custom computer programs. I completely understand this because I have written hundreds of thousands of lines of computer codes in up to 10 different computer languages. I have worked with a wide array of data, which helps me fully appreciate what can now be done so easily in Alteryx and Tableau compared to more labor-intensive, traditional programming methods.

I specifically discuss how I consistently achieve insightful data comprehension by using Alteryx and Tableau together in a cycle of data exploration followed by insight exploitation. I explain how it’s possible to do by illustrating my methods in a real-world case study. This study is an example of something I have been working on for nearly three years and it should be of interest to most readers that happen to live on planet earth.

The Example: Comprehending Global Climate Change

The topic of global climate change will be addressed to determine whether global warming is a real or hyped phenomenon. I chose this topic of study back in 2014 because I have been a student of the earth since my undergraduate years in the mid-1980’s, which is when I completed my geological studies.

Since that time, we have been inundated with evidence that global warming is occurring. Every day, it seems, we have to forge our way through the hype to determine whether or not global warming is a real phenomenon, and whether it is being caused by human activities. In this article, I will show how I acquired, processed and visualized large quantities of weather data to gain additional insights on climate change and global warming.

I decided that I had to do my own investigation of global warming because I was not able to find accurate and insightful global warming images. I could not find any simple explanations that showed how temperatures have varied over space and time. I decided that I had to do this myself because the professionally published papers are too myopic and detailed to gain true insight into the spatial and temporal climate changes we have experienced.

I started my quest by finding a reliable source of data that spanned a large time period. After studying the project that manages the data, I obtained the data and the real work began.

Early on in this study, I decided that I had to quantitatively process the data in multiple levels of detail to answer strategic questions. In the purest sense, I wanted to know whether or not global warming is a real phenomenon. Once I decided that to do that, I had to be prepared for a long journey and one in which the data had to be consistent and accurate enough to conclusively answer the question and tell the complete story.

In the Beginning (2014)

I started the work in late 2014 and I first documented the preliminary work in a series of five articles in early 2015. These articles explain in detail how components of daily weather data can be accessed, processed in Alteryx, and visualized in Tableau. With easily over 100 hours of work invested in this series of articles, it was a great reference for me to be able to continue the work a couple of years later. You can access the articles by clicking the blue text.

  1. Part 1 – Project Introduction
  2. Part 2 – The Source of Climate Data
  3. Part 3 – Reading Weather Station Data
  4. Part 4 – Alteryx Workflow Details For Reading Data
  5. Part 5 – Using Tableau To Examine Texas Temperature and Precipitation Data

Since the global climatic historical data network (GHCN) stores daily weather information going back to the 1700s, there is a lot of data to process. There are billions of records of data measured at over 100,000 monitoring stations across the globe.

Not every monitoring station continues to be an active data collection site, but historical data from defunct stations are still included in the data set. There are many different types of weather measurements included in the data, so I had to be purposeful in choosing what I wanted to process and analyze.

Exploiting the Data Via Alteryx

I also had to be selective in choosing which stations I wanted to include for my analysis. These two facts impart constraints on your data processing methodology. Your tool needs to be flexible so that you can pick and choose what data to process and visualize to avoid producing extremely large data volumes that take a long time to create and a long time to use. Learning how to choose which data to interrogate is a necessary skill as you move towards becoming a data exploiter.

For these reasons, I chose to study daily minimum and maximum temperatures (and daily range), precipitation, snowfall, and snowfall depth. These variables address two of the three primary components of weather: temperature and precipitation. At this time, I have not chosen to study the third component of weather, which is wind. I have also processed data from up to just over 12,000 monitoring stations, although much of the work I analyzed came from 5,400 monitoring stations around the world. I routinely use over 126 million records of daily temperature data.

To even begin this type of study, you have to be able to read the data that is stored in the GHCN. This is not a trivial task since the data is stored in an arcane flat-file format. I believe that this file was chosen to minimize storage requirements, but as shown in Figure 2 below, this file format does not lend itself to immediate data comprehension. A lot of work is needed to uncover the actual data stored in the files and many different types of files have to be read to construct the global data set. Operations are needed to parse and blend the data, to perform unit conversions, and to create the dates associated with each measurement.

Daily_flat_file

Figure 2 – An example daily data flat file for station USW00093822. Each of these files is 269 columns wide and thousands of rows deep.


You have to use software like Alteryx to create the data files that can be visualized in Tableau. Three years ago when I began the project I had to learn how to read and write the data from these flat files and then produce the Tableau-compatible files for visualization. This process is critical in achieving data comprehension. If you cannot successfully complete this step, which is turning arcane data into forms that can be visualized and comprehended, you cannot complete your mission.

To date, I have written more than 10 articles that describe how this work was completed. If you want to learn how to do this, you can read the articles to learn all of the details. The articles start with reading data files and then progress to writing strategic types of output files that are in different blended and aggregated forms. Figures 3 through 11 are a few of the Alteryx workflows that I wrote to read and process the >100,000 daily data files that were shown in Figure 2.

Step1

Figure 3: Step 1 – Process the Monitoring Station Data.


Step2

Figure 4: Step 2 – Process a Selection of Monitoring Station Data Having Max/Min Temps.


Step2a_write_daily_records

Figure 5 : Step 2a – Write Daily Max/Min Temp Records.


Step2b_write_monthly_records

Figure 6 : Step 2b – Write Monthly Max/Min Temp Records.


Step2c_write_decade_records

Figure 7 : Step 2c – Write Decade Max/Min Temp Records.


Step3_batch_macro

Figure 8 : Step 3 – Batch macro which processes the *.dly (daily data) files. This receives a list of stations to process.


Step4_compute_monthly_averages

Figure 9 : Step 4 – Process Daily TMax/Tmin Data To Generate Monthy Averages.


Step5_Compute_Changes_1960s_to_2010s

Figure 10 : Step 5 – Process Data from 2010’s and 1960’s to Compute Temp Changes.


Step6_country_summary_by_month

Figure 11 : Step 5 – Compute Country Averages By Month. A cool 176 million daily records were read from the input file!


The reason that I chose to work at multiple levels of detail is that there is a large amount of spatial and temporal data to consider when deciding if global warming is happening. In my case, I worked my way through the data by starting at a high level of detail (decade aggregation) and worked down to yearly, monthly and daily levels of detail. I found that time series plots of different levels of aggregation were very helpful to me to identify climate changes over time.

I also considered aggregating the data by country, by state and by plotting data at all the individual monitoring stations. By deciding to examine a fine level of detail all the way down to daily maximum and minimum temperature data at monitoring stations, I was able to examine how much data temperature variability is occurring over time.

The sparring I did with the data took a long time since I had to do this work in my spare time and it was computationally intensive. It was not initially obvious to me which method of data preparation would lead to the best insights for this type of data.

By starting with monthly aggregated maximum and minimum temperatures at each monitoring station, I began to see some patterns emerge in the data that totally surprised me, as shown in Figures 12 (March) and 13 (May). I was able to simplify the approach by looking at temperature changes over 50 years, from the decade of the 1960’s to the decade of the 2010’s. By computing these 50-year differences for each month of the year, I achieved clarity in the data that is fascinating and that was described in this presentation.

March_60_to_10

Figure 12 – When You Compute The Temperature Differences in March from the 2010’s to the 1960’s. There is a Large Pattern of Heating in the Central US. Spring is Occurring Sooner Now Than When I Was Young.


May_60_to_10

Figure 13 – Conversely, May Temperatures Are Cooler in the 2010’s Compared to the 1960’s. What Would Happen to Crops in May if Farmers Were Tempted to Start Planting Earlier in the Year Because of the March Warming?


By using an iterative approach to processing data and visualizing it, I achieved comprehension at multiple levels of spatial and temporal aggregation. With each successive iteration (there have been three), I was able to ask additional questions about the data and I was able to check the accuracy of my previous work. I was pleased to find that I had produced unique results in each iteration. This method of work allows data comprehension to be achieved because you learn something each time you process and visualize the data.

This comprehension has allowed me to understand the variability of weather and climate that occurs on earth. This method of using descriptive analytics examines the history of what has happened, but it doesn’t explain why these changes have occurred. Since I am a trained mathematical modeler, I even tested whether the data could be used to do some forward forecasting of temperature change. When you try to do this, you quickly realize how difficult it can be to make predictions of future weather conditions!

To summarize this work, I would say that my comprehension of global warming improved dramatically. I uncovered climate changes that have been happening over my lifetime (50+ years) that surprised me and gave support to the large-scale changes that we have observed. These changes include permafrost melting, shrinking glaciers, loss of arctic sea ice, seasonality changes, and other things. Many of these findings are discussed in the 45-minute video embedded in this article. I also identified that certain regions have cooled in certain months over the past 50 years. This was also a very surprising result.

The General Business Case

If I switch over to discussing a business case, for example, where I’m looking at a new data source from a client, I begin my exploration of the data by visualizing at the highest level of aggregation. I quickly go back to the data to view another level of aggregation once I get a feel for the information. This type of iterative process occurs in those projects just as it was done in the climate data study.

Typically, as you begin exploring the data and you go through your first cycle of processing and visualization, you might talk to somebody who is knowledgeable about the topic or somebody who works with the data on a routine basis. You might ask them to give you insights that they might know such as long-term trends and behaviors.  You might even consider striking up a relationship like the one described in this article. These types of discussions can lead you to develop a new approach to processing and visualizing the data because you were able to learn about other business rules or logic that you had to apply when processing the data.

When you are not the subject matter expert in the data you are working with, you can quickly improve your data comprehension by using a standard approach to initial data exploration. I use Tableau with its standard visualization types of horizontal bar charts, histograms, maps, and other views to interrogate the data in certain ways. This approach keeps you focused on the highest levels of the data to begin building your knowledge base about that data.

Final Thoughts

Over the past several years, I have learned to use Alteryx and Tableau in an iterative cycle of data processing and visualization to achieve a high degree of data comprehension across many disciplines. It doesn’t matter to me where the data originates. The data can be from science, business, medicine, education, telecommunications, manufacturing, or many other disciplines. It doesn’t even matter what format the data is stored in due to the multitude of data connectors available in Alteryx and Tableau.

If you use the techniques I have outlined, you will achieve better data comprehension in a very rapid and transformative way. Your evolution from data explorer to data exploiter will occur much more quickly than if you are a jack of all trades, or sit and toil writing countless thousands of line of computer code to process your data. Alteryx will allow you to quickly unlock the secrets in your data while Tableau will brilliantly display what you have found. Once you become a data exploiter, you will know it, and so will the people that you work with.

Disclosure

It took me over three decades to learn what I have described in this three-part article. It took me over 8 months to write and thousands of words to formulate my thoughts. I’m glad I am done with it because it has been exhausting to create.

I do not work for either Alteryx or Tableau and I do not receive any benefits from either company. I do not make any money by writing this blog. I write this blog because I want to. All statements made herein are mine, and I take full responsibility for my commentary and perspectives.

I openly share this type of information so that we, as a society, can make better decisions with the data we collect. I have been practicing data sciences for over 40 years, which is when I wrote my first computer program. I have been steadily employed my whole life as a quantitative worker and have now reached a point where I have wisdom when working with data. I share this wisdom by writing this blog.

I have tried many approaches in working with data across many computing platforms. No combination of software has ever even remotely approached the power, speed, and versatility that I achieve when using Alteryx and Tableau to process and visualize data, especially when the data originates in different systems and in different pieces. It is that simple. You can choose to believe me or not. That choice is yours to make, but in any event, thanks for reading!

One thought on “How To Achieve Better Data Comprehension, Part 3

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.