Introduction
Rarely, if ever, does data arrive to me in the exact format that I need it to be in for efficient analysis in Tableau. Â This means something has to be done to the data before I can use it. Â In Part 1 of this series, I provide specific examples of real-world data format changes that had to be completed to get the data ready for Tableau. Â In Part 2 of this series, I discuss additional tools and techniques that I use to accomplish related operations.
To get started, click here to download this wonderful regex cheat sheet.
I almost always have to fix formatting issues, remove incorrect data or remove extraneous information to extract the information that I want to analyze. Â This is nothing new to me since I spent over 20 years as a high-performance numerical modeler in which the model input data sources were a wide range of databases of various quality levels.
Fixing, restructuring and reformatting data is an exercise that requires a “toolbox” of techniques, much like a carpenter needs a toolbox full of tools to do his job. Â When you are used to working with data files that can be billions of lines long, working with typical business data files seems relatively easy at times, but sometimes there are a fair share of challenges that arise. Â This post gives some solutions to common problems that I have encountered through the years.
If you like this article and would like to see more of what I write, please subscribe to my blog by taking 5 seconds to enter your email address below. It is free and it motivates me to continue writing, so thanks!
Motivation for this post
I have used Tableau for over seven years on a daily basis, with over 1,300 distinct sources of data. Â These data sources have come from a wide range of industries including scientific disciplines, education, healthcare, finance, retail, manufacturing, telecommunications, and many other businesses.
These data files have been structured and unstructured, ASCII flat files and delimited files, as well as a variety of database source files. Some of the data has been through extensive review and other sources contain a lot of information of questionable value.
Some of the data I have captured using web-based technologies such as API calls, which can create files having a lot of variation in the data formats captured. Â Capturing Twitter data, for instance, produces data files having different languages, differences in control characters such as line feeds vs carriage return line feeds, etc.
To address the multitude of issues, you have to build your data manipulation toolbox to be able to format the source data appropriately for usage in Tableau. Additionally, you have to build a solid foundation of Tableau experience to recognize why data you have sent to Tableau is not doing what you think it should be doing. Â Sometimes these types of problems can simply be traced back to how the data is formatted in the file you are sending to Tableau.
Techniques for fixing data formats
When it comes to changing data formats, I principally use a 64-bit programmer’s editor named Vedit (www.vedit.com) to operate on ASCII data files. Â I use Vedit if it is a one-time manipulation. If I have to do the work repeatedly, I use the regex parsing tool in Alteryx.
I have used Vedit for my entire professional career, and I’m still amazed on how I’m learning new techniques all the time. This editor is the only tool I have found that can perform blazingly fast operations across multiple files and files having millions or billions of lines (it can essentially handle unlimited file sizes).
Vedit has a macro language that is a bit cryptic and hard to learn but is extremely powerful and configurable. There are a wide range of pre-programmed macros that come with Vedit that allow you do some amazing things. Â Searching and replacing can be performed using pattern matching techniques, regular expressions, or programmed any way you want using the macro language. Â I use Vedit many times every day because it simply gets the job done better than anything else I have seen on the Windows platform, other than Alteryx, of course.
Examples of Data format changes/fixes using Vedit
What follows are a number of real-world, “data-format fixing” examples compiled from the past couple of years that were conducted in Vedit. Some the examples using pattern matching techniques and some use regular expression techniques. Â These examples give some insight into the types of changes that need to be made to data files originating from a variety of sources.
********************************************************************
Example 1: Replacing html tags with quotes to get the content between the tags. Three lines become one quote delimited line.
Source data before the change:
<title>
RRP #56: Chef AJ: From Junk Food Vegan to Whole Food Plant-Based Chef
</title>
Data after the change:
“RRP #56: Chef AJ: From Junk Food Vegan to Whole Food Plant-Based Chef”
Pattern match search and replace
Part 1: find <title>|013|010, replace with ”
Part 2: find |013|010</title> replace with “
********************************************************************
Example 2: Removing html tags to get the text label that resides within the <a> </a> block.
Source data before the change:
<a href=”http://www.apple.com” rel=”nofollow”>Safari on iOS</a>
Data after the change:
Safari on iOS
Regular expression search and replace
Search for <{[^>]+}>
Replace with nothing
********************************************************************
Example 3: Removing spurious <LF> characters that show up in the Twitter tweet field when capturing Twitter data with their API. These arise because people use many different computing platforms, apps and operating systems when submitting Tweets to Twitter.
Data Before Cleaning (3 lines with two spurious <LF> characters):
“Hyper realism illustration for the Wall (2011) by COCOCHOE<LF>
<LF>
#drawing #dessin #? #?? #tableau #painting… http://t.co/92wIoNB3qm”
Data After Cleaning (1 line, no extra <LF> characters):
“Hyper realism illustration for the Wall (2011) by COCOCHOE #drawing #dessin #? #?? #tableau #painting… http://t.co/92wIoNB3qm”
Notes:
- This operation will take two passes to fully clean the file;
- Reg expression search (finds anything before a <LF> other than a <CR>, which keeps it off the end of the line sequence)
Part 1 Using Reg Expression Search/Replace: Â Clean <LF> anywhere, but leaves <LF> at end of lines
Search {[^\h0D]}{\h0A}
Replace \1
Part 2 – get rid of <LF> remaining at the end of lines once Part 1 is completed
Search {\h0A}$
Replace Nothing
********************************************************************
Example 4: Cleaning the Twitter API statuses field to gain access to the Twitter profile
Data Before Cleaning:
Data After Cleaning:
Regular expression search and replace
Search /statuses/[0-9]+
Replace nothing
********************************************************************
Example 5: Swap columns 1 and 3 in a comma delimited ASCII file and reformat the date field that is going to column 1.
Data Before Swap:
DMA,STN,DATE,F_DOLLARS,GALLONS,COUNT
“ALBUQUERQUE-SANTA FE”,0000857996,11-OCT-12,11216.61,2790.2,80
” “,0000811843,01-APR-12,3580.63,730.9,3
Data After Swap:
Date,STN,DMA,F_DOLLARS,GALLONS,COUNT
OCT/11/2012,0100807996,”ALBUQUERQUE-SANTA FE”,11216.61,2790.2,80
APR/01/2012,0000811843,” “,3580.63,730.9,3
Regular Expression Search  and Replace
Search for ^{“}{[^”]+}”,{[0-9]+},{[0-9]+}-{[A-Z]+}-{[0-9]+}
1 2 3 4 5 6
where
1 is first quote
2 is anthing up to the next quote
3 is the 10 digit integer
4 is the DD (DAY)
5 is the MMM (MONTH)
6 is the YY (YEAR)
Replace with
\5/\4/20\6,\3,”\2″
********************************************************************
Example 6: Reformat dates in a very big, multi-million line file
Data Before Change:
“03-JAN-12”
Data After Change;
01/03/2012
Here are all 12 month replacement codes (for dates beyond the year 2000):
{“}{[0-9]+}-{JAN}-{[0-9]+}{“}
01/\2/20\4
{“}{[0-9]+}-{FEB}-{[0-9]+}{“}
02/\2/20\4
{“}{[0-9]+}-{MAR}-{[0-9]+}{“}
03/\2/20\4
{“}{[0-9]+}-{APR}-{[0-9]+}{“}
04/\2/20\4
{“}{[0-9]+}-{MAY}-{[0-9]+}{“}
05/\2/20\4
{“}{[0-9]+}-{JUN}-{[0-9]+}{“}
06/\2/20\4
{“}{[0-9]+}-{JUL}-{[0-9]+}{“}
07/\2/20\4
{“}{[0-9]+}-{AUG}-{[0-9]+}{“}
08/\2/20\4
{“}{[0-9]+}-{SEP}-{[0-9]+}{“}
09/\2/20\4
{“}{[0-9]+}-{OCT}-{[0-9]+}{“}
10/\2/20\4
{“}{[0-9]+}-{NOV}-{[0-9]+}{“}
11/\2/20\4
{“}{[0-9]+}-{DEC}-{[0-9]+}{“}
12/\2/20\4
********************************************************************
Example 7: Get rid of ” ” around 5 comma delimited numbers in a row, that have various formats
Data Before Change
3,”988″,”988″,”1000″,”2961.95″,”-69.81″,”09″
2,1000,988,1000,3035.1,3.34,”09″
6,1000,984,1000,3035.1,15.62,”09″
7,”7700″,”7643″,”7700″,”21790.23″,”-498.9″,”09″
9,”7000″,”6934″,”7000″,”19809.3″,”-412.19″,”09″
0,”6000″,”5946″,”6000″,”16979.4″,”-360.8″,”09″
5,”6700″,”6606″,”6700″,”19075.84″,”-177.88″,”09″
2,”7000″,”6922″,”7000″,”19809.3″,”-377.19″,”09″
3,”7712″,”7655″,”7712″,”21824.19″,”-499.93″,”09″
4,”5500″,”5463″,”5500″,”15564.45″,”-367.19″,”09″
8,”8500″,”8431″,”8500″,”24054.15″,”-533.01″,”09″
1,”8000″,”7912″,”8000″,”22639.2″,”-434.41″,”09″
8,”7703″,”7639″,”7703″,”21798.72″,”-478.74″,”09″
Data After Cleaning:
3,988,988,1000,2961.95,-69.81,”09″
2,1000,988,1000,3035.1,3.34,”09″
6,1000,984,1000,3035.1,15.62,”09″
7,7700,7643,7700,21790.23,-498.9,”09″
9,7000,6934,7000,19809.3,-412.19,”09″
0,6000,5946,6000,16979.4,-360.8,”09″
5,6700,6606,6700,19075.84,-177.88,”09″
2,7000,6922,7000,19809.3,-377.19,”09″
3,7712,7655,7712,21824.19,-499.93,”09″
4,5500,5463,5500,15564.45,-367.19,”09″
8,8500,8431,8500,24054.15,-533.01,”09″
1,8000,7912,8000,22639.2,-434.41,”09″
8,7703,7639,7703,21798.72,-478.74,”09″
Regular Expression Search and Replacements
Note: Multiple Passes Required to Handle Multitude of Formats
Start with
Search “{[0-9]+}”,”{[0-9]+}”,”{[0-9]+}”,”{[0-9]+.[0-9]+}”,”{[0-9]+.[0-9]+}”
Replace \1,\2,\3,\4,\5 (used with all the remaining examples)
For the fifth number with only two decimal digits
Search “{[0-9]+}”,”{[0-9]+}”,”{[0-9]+}”,”{[0-9]+.[0-9]+}”,”{.[0-9]+}”
For the fourth number as int and fifth number with only two decimal digits
Search “{[0-9]+}”,”{[0-9]+}”,”{[0-9]+}”,”{[0-9]+}”,”{.[0-9]+}”
For negatives in the fifth number
Search “{[0-9]+}”,”{[0-9]+}”,”{[0-9]+}”,”{[0-9]+.[0-9]+}”,”{-[0-9]+.[0-9]+}”
For negatives in the fifth number with only two decimal digits
“{[0-9]+}”,”{[0-9]+}”,”{[0-9]+}”,”{[0-9]+.[0-9]+}”,”{-.[0-9]+}”
For integers in the fifth number
Search “{[0-9]+}”,”{[0-9]+}”,”{[0-9]+}”,”{[0-9]+.[0-9]+}”,”{[0-9]+}”
********************************************************************
Example 8: Change date format at the end of a line from yyyymmdd to mm/dd/yyyy
Data Before the Change:
“0003405855”,”S10422″,”2004 Fall-2005 Calendar”,”Mail”,”20040903″
“0025005555”,”S09904″,”2008 Fall-2009 Calendar Sale”,”Calendar”,”20081015″
Data After the Change:
“0003405855”,”S10422″,”2004 Fall-2005 Calendar”,”Mail”,09/03/2004
“0025005555”,”S09904″,”2008 Fall-2009 Calendar Sale”,”Calendar”,10/15/2008
Regular Expression Search and Replace
Search {“}{[0-9][0-9][0-9][0-9]}{[0-9][0-9]}{[0-9][0-9]}{“}$
Replace \3/\4/\2
********************************************************************
Example 9: Change date format in the middle of a comma separated line from yyyymmdd to mm/dd/yyyy
Data Before Change:
0000005855,0000515971,19880926,1000.00
Data After Change:
0000005855,0000515971,09/26/1988,1000.00
Regular Expression Search and Replace
Search  {,}{[0-9][0-9][0-9][0-9]}{[0-9][0-9]}{[0-9][0-9]}{,}
Replace \1\3/\4/\2\5
Final Thoughts
The examples shown above illustrate how a powerful programmer’s editor can be used to change data formats, swap columns, extract information from unstructured files, and convert date formats. There are a lot of other examples I could provide, but these may give you ideas of how to attack problems you may be having.
In part 2 of this series, I’ll discuss how I use Alteryx to accomplish some tasks that make working with data in Tableau much easier on a daily basis.
Pingback: How to Extract Hyperlinks From Excel Formulas To Use In #Tableau Dashboards | 3danim8's Blog
Pingback: Workout Wednesday – Week 2 – YetterDataViz