Data Science: Why Data Comes Before Science

excerpt


Executive Summary

To get the science right, we have to get the data right.

Ken Black, Knoxville, TN

That is why the word “Data” comes before the word “Science” in the now popular profession of “Data Science”.

Just as we have to learn numbers and how to add, subtract, multiply and divide before we can progress to algebra, geometry, calculus and other forms of advanced math, we need to learn how to get the data right before we can solve business and science problems using advanced techniques.

In this article, I offer some perspectives on the importance of “data skills”. Workers in the field of modern analytics need to have a diverse data skill set that starts with a being able to manipulate data with a multitude of operations such as parsing, joining, filtering, converting, reshaping, aggregating, and a host of other techniques. Additionally, expansive knowledge of database systems that are used to store the data is also very helpful.

Introduction

I have been calling myself a “computational scientist” for many years, long before I ever heard the buzzword “Data Science”. I chose that title because I have specialized for over three decades in serious computer programming and numerical computing to solve scientific and business problems.

The number-crunching types of computing I used to do spanned the realm of numerical models running on PCs up to massively parallel supercomputing applications. Most of these examples required building 4D datasets (x,y,z, time) that could be used as input to simulation models. These models approximated governing equations of various types to make future predictions. This work blended advanced mathematics with computer science and required extensive data preparation and visualization techniques to perform the work.

A long time ago, I learned that to get the science right, the data had to be right. In other words, the data had to come before the science. For this reason, I like the term “Data Science”, although many people think it is just a relatively new buzzword.

If we were interested in learning about the “science of data”, then we really would be talking about a completely different topic and we would be sitting in a computer science class focused on databases or some other closely associated topic.

In working on real-world “Data Science” projects that had millions of dollars at stake, if we made data mistakes, the scientific results would not be valid and we could squander a whole lot of money. Generally, by quantitatively analyzing and visualizing the model output, we could tell if the input data was good, bad or ugly. We always tried to avoid creating ugly input data for the models.

In this article, I want to explain why getting the data right is so important in the field of “Data Science”, although it may not seem to be the sexiest part of the job. For me, however, it is the best and most interesting part of the job.

Background

I’m not an esoteric type of computer scientist that solves obscure problems or performs research deep down certain types of rabbit holes, although some people might think that I am. I use my education, work experience, and software tools at my disposal to solve real-world, data-driven problems. In other words, I work on practical problems that can be better understood or solved by analyzing and visualizing data.

I like to be able to explain to people how the data that I’ve analyzed leads to conclusions that help solve the problems. These problems are now mostly focused on business-related issues, but I periodically return to my scientific roots and do things like this.

What I have learned through all of these years of solving problems in deadline-driven situations is that getting the data right is crucial, it can be difficult, and it generally is time-consuming. However, the success of almost all data science projects hinges on the quality and applicability of the data being collected to help solve the problem. Before the solution can be obtained, however, the data has got to be right.

As data sets have continued to grow in size and complexity, the questions being asked of the data have also become more intricate and getting the answers is more challenging. Working with the data is now even more vexing unless you have discovered a secret recipe to success. I’ll explain my recipe later in this article.


A Data Science Misconception: Getting the Data Right is Pure Drudgery

I just finished reading an article that opened with this surprising statement:

Wes.JPG

I immediately knew that I was going to enjoy what Wes had to say on this topic because I have been writing about this very issue for a few years.

Now I don’t proclaim to know anything about Wes’s “Pandas”, but I can tell you that I will be learning about this approach to see how it helps us perform “Data Science” activities compared to the way that I currently work. After all, the title of that article makes a bold claim as shown in Figure 1:

Pandas

Figure 1 – I really want to see for myself what the most important tool in data science is!

For now, I’m going to shelve this part of the discussion until I have more time to investigate this claim. I’m sure I’ll have more to say about this in the future.


My Good Luck in Learning How To Get the Data Right

Fortunately for me, I’ve been able to experience some really good luck over the past decade. This luck relates to being able to learn and use Alteryx and Tableau, as well as learning statistically-based process improvement methods.

Learning about Design of Experiments and studying the work of W. Edwards Deming can help take a quantitatively skilled scientist and turn them into an efficient problem solver. I believe that this is what happened to me, almost by accident.

The First Five Years and the Accidental Discovery of Tableau

One day in February 2008, I was standing in the office of my buddy French. We were discussing how I was processing some data for a multi-variable test we were conducting. I was explaining to him some of the details about a code I was writing to process the data. He innocently made this statement to me:

Hey, my son told me last night that they are learning about a new data visualization program called Tableau. Maybe you should take a look at it to see if it can help you.

A few minutes later, I was downloading a trial version of the Tableau software (ver 3.6) and the future course of my career immediately changed. I say immediately, but it really didn’t happen that fast.

It took me some time to learn about how Tableau operated. I had to make plenty of mistakes along the way before things became clear to me. Once that happened by mid to late 2008, I began to experience the euphoria of living in a Tableau world.

During my first five years of using Tableau, I was doing my own Makeover Monday experiments during my project work. I was creating all different types of visualizations for data emerging from a multitude of scientific and business fields. When you do this type of experimentation and have project team members review the results of your testing, you quickly determine what works and what doesn’t work in data visualization.

I had a great time during these years, and I learned to truly appreciate the power and versatility of Tableau. I also began to understand the proper uses of Tableau and when I was attempting to push it too hard. I began to understand how important it was for me to properly prepare complex data for visual and quantitative analysis. I also learned to keep things simple on the visualization front.

The Next Five Years and the Emergence of Alteryx

About 6 years ago, I started experiencing difficulties in the Tableau-based projects I was conducting because the data I was asked to work with became complex and voluminous. Different types of data sources were also becoming more common. I will discuss a couple examples to help explain the issues.

Example 1

Imagine that you are given a 40-year record of donations to an institution and asked to uncover why certain donation campaigns were better than others. Within each year, there were many separate campaigns and the data that was collected during those campaigns was given to you.

Basic math tells you that you are easily dealing with data from hundreds of donation campaigns. The problem is that there was little to no consistency in the data collected during the campaigns or the file types you have been given to work with.

There were no key fields that could be used to tie together the campaigns that spanned four decades, other than time itself. You suddenly find yourself drowning in a pool of chaotic data without a life preserver.

You quickly learn about the problems that one-to-many relationships can cause when you try to tie data together based on date and time. The data explosion that occurs, in that case, makes you realize that it is going to be very hard to tie all this data together into a suitable framework for analysis.

This is when it occurs to you that there must be a better way to work with this type of data.

Example 2

In another case, imagine that you have sales histories from hundreds of stores, spanning several decades. The sales histories that you have been given go down to the SKU level, meaning that there are tens of millions of transactions for thousands of products.

The volume of data is immense, the questions being asked are many, and the time to do do the job is limited to a few weeks. At the same time, you have a second job similar to this, with a similar amount of data and complex questions to be answered.

In these cases, the projects should have spanned twice or three times the time allocated to them. However, that wasn’t the case for us, so as we tried to jam all this data through Tableau in the pressure cooker situations we found ourselves in, we experienced significant pain.

It was at this time that part 2 of my good luck occurred. I finally got an Alteryx license after years of asking for it!

Alteryx to the Rescue

Just as Tableau took me a while to understand, so did Alteryx. Since I was learning Alteryx during high-pressure projects, I didn’t have the luxury of going through any formal training. I learned how to use Alteryx by solving real-world problems.

This was both a blessing and also a curse because I probably could have done some things easier in the beginning if I had formal training. However, making mistakes will sometimes help you learn concepts better than just about any other method.

After a few months, I started to see the light. I began to understand the role of Alteryx in Data Science, and I learned how to let Alteryx do the heavy data manipulations and let Tableau do the visualizations. I learned that it is best to let Tableau create clear and concise visualizations to tell the data stories.

Over time, I developed my own style of data prep in Alteryx followed by visualization in Tableau. I learned to fly back and forth between the packages at the speed of thought. I learned to give Tableau data structures that it could easily process so that it was efficient in visualizing huge amounts of data. I asked Alteryx to perform billions of calculations while it shaped and reformed the data into the structures that I needed it to be.

In other words, I understood how to get the best performance out of Alteryx and Tableau so that I could very quickly process very large amounts of complex data to answer the specific questions being asked of the data. By repeating and refining these work methods, I learned how to solve advanced analytics/ data science problems with these two tools. This is one of my recipes for success in doing data science work.

Advanced Data Analytics

Advanced data analytics and data science are a continuum for me. I can do pure quantitative and visual analytics and I can run computational models of various types. This is true for me because of my education and work experience. For these reasons, I understand the importance of getting the data right in data science as well as advanced analytics problems.

In whatever type of work you do, correctly processing the data is the most important thing you do. If you feed bad data into sophisticated programs, the output will not lead you to the correct conclusions. Therefore, one reason I decided to write this article is to explain why Alteryx allows you to get the data right without the drudgery!

The Role of Alteryx

Before I used Alteryx, I really had no idea what it could accomplish. It was a mysterious collection of tools that did certain things. In one sense, it seemed like it was designed to perform individual operations on data. It seemed to me to be a tool much like my programmer’s editor (Vedit) that I had been using since the mid-1980s. Luckily, I was incorrect in that view.

What I have come to learn is that Alteryx is a holistic problem-solving platform that can allow you to solve simple to very complex problems with repeatability and efficiency. The repeatability exists because of the clever nature of the workflows that store the sequence of operations that are performed to process the data in the ways that need to occur. The efficiency occurs because Alteryx does so much behind-the-scenes work for you that the software allows you to accomplish huge amounts of work in very little time. All of this occurs because the software is optimized for speed, it is scalable, and it is brilliantly designed and executed.

Continuing my Mission of Educating Young Workers

To understand how this occurs, I plan to continue writing articles that give examples of how Alteryx and Tableau can be used to solve problems in data science and advanced analytics. I’m going to dive deep by using each tool to show how these two platforms can compete with any existing software platform in existence for doing this type of work.

I have already written about 300 articles that have set-up the framework for this more advanced work. If you are so inclined, go back and read about the specific techniques and methods I explained and documented to prepare yourself for this new way of thinking. For an overview of where we are headed, consider reading my data comprehension series.

I’ll be showing how Alteryx can be used to create production data sets in an enterprise setting, where a multitude of data sets need to be brought together to solve challenging problems. I’ll explain how integral Tableau is in the processing and production of complex data during the data processing phase of a “data science” project, as well as for interrogating and interpreting the final results. As always, this will be done without using proprietary data but the concepts and techniques will be made clear. My goal will be to explain and demonstrate some of my specific recipes for success in doing this type of work.

Finally, sometimes I have mixed feelings about writing these types of articles because they lack specificity and are devoid of actual techniques that people can use. These articles are also not easy to write because it forces me to put into words the things I have learned over the past decade about doing incredible work in short amounts of time. That isn’t necessarily easy to do, either.

However, the value of articles like this one is that readers can learn about what is possible with these tools, and how these tools can be used to solve very challenging problems. Alteryx and Tableau are not just suitable for processing and visualizing simple data. These tools are capable of so much more, and for that reason, I believe articles like this one need to be written.

Featured

Figure 2 – This is intentionally unreadable, but it demonstrates what it takes to get the data right before the science can be determined. Logic, business rules, and data manipulation techniques are put together in a sequence to blend and move a series of raw data files from a state of near uselessness to a state of value.

 

Update on 2/12/18 From the BBC!

Today I saw an interesting article from John Larder about work that the BBC is doing to restructure their extensive dataset. It had a great quote that is shown in Figure 3.

QED

Figure 3 – For those of you unfamiliar with Q.E.D., let me explain. Q.E.D. is not being used here for Quantum Electrodynamics, although that is one usage! Rather, Q.E.D. is being used for Quod Erat Demonstrandum (Q.E.D.), or this is the end of the mathematical proof. I thought that this quote would be a perfect way to end this article.


 

Updated on 7/1/2019 From Scientific American

Today I saw an interesting article that indicates adult brains can grow new neurons. I’m not too surprised by this, considering how I have progressed in my abilities to work with data and solve challenging problems.

As I mentioned in this article, I feel like I have gotten better with age.

The Adult Brain Does Grow New Neurons After All

The Final Thought

If you use Alteryx, you will optimize your data cleansing and preparation, giving you more time to run your models. I know this to be true and one day you will, too!

 

One thought on “Data Science: Why Data Comes Before Science

  1. Pingback: The Making of Episode 2 of the Alter Everything Podcast (@Alteryx) | Data Blends

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.