Many people may not have heard of Tableau’s new support for regular expressions, and in fact many people may not even know what regular expressions are. I want to fix both of those in this blog post, covering:

If you’ve never used these suckers before, prepare for your mind to be blown.

Many people may not have heard of Tableau’s support for regular expressions, and in fact many people may not even know what regular expressions are. I want to fix both of those in this blog post, covering: What regular expressions are, and why they’re so powerful,

How to use regular expressions in Tableau, and Some neat things you can do with RegEx’s in Tableau. If you’ve never used these suckers before, prepare for your mind to be blown. Introduction to Regular Expressions

Let’s say you need to need to find all the phone numbers in some large text documents. You know they’re all formatted like “###-###-####”, but you have no idea where they’re all located, and there’s no way you have time to manually find them all. Regular expressions to the rescue!

You can use regular expressions to find strings of characters that follow some explicit pattern, even if that pattern is complicated or has a lot of potential variation. In this case, our pattern is very straight-forward: we want three numbers (specifically, numeric characters zero to nine), then a hyphen, three more numbers, another hyphen, then four more numbers. This is a piece of cake to do with regular expressions:

[0-9]{3}-[0-9]{3}-[0-9]{4}

Looks a bit confusing at first, but let’s break it down:

[0-9] means find any numeric character, and the {3} after it means find three in a row The hyphen “-” simply means find a hyphen.

We have [0-9]{3} again, which means find three numeric characters in a row. Find another hyphen.

Finally, we have [0-9]{4} which means find four numeric characters in a row.

Since we’ve put all of these together into a single regular expression, it will only find strings that fit all of these properties, all in a row. So it would match “000-000-0000” but not “000.000.0000” because it wouldn’t find the necessary hyphen after the first three numbers. For a great interactive tutorial on regular expressions, check out RegexOne’s site.

Regular Expressions in Tableau

Support for regular expressions was just released in Tableau, which vastly increases your string processing power inside of the program. What used to take endless nesting of FIND, LEFT, MID, and RIGHT function calls can now be done easily with regular expressions.

Let’s say, for example, that we have a dataset that includes an Address field that includes city, state, and postal code, and we want to parse out at least the state and postal code. (We could potentially also do this using the SPLIT function, or Tableau 9.0’s ETL functionality, but here we’ll use regular expressions.) Maybe the data isn’t formatted completely uniformly, with commas and spaces randomly appearing here and there.

Here is some mock data for us to play with these conditions:

123 North Pole, North Pole AK 12345314 Pie Street, Tastyville SD 31415Something badly formatted 91919$15 is what I have in MY pocket$10000 is what I wish I had in my bank account

Here I have a few lines of standard addresses, as well as some extra lines to show some characteristics of regular expressions. I’ve saved the lines above to a plain text file (.txt), and connected to it using Tableau. As you can see below, we’ll need to tinker with a few options in the Data Source window before starting work.

It looks like Tableau is splitting up our lines by spaces, and also assuming the first line to be column names. By clicking the gear symbol on the pill, we can adjust those settings to be what we want.

We don’t have field names in our text file, so we’ll have those automatically generated for us. Also, we don’t want to separate the lines by anything since we only have one field here (“Address”), so we’ll choose “Tab” as our field separator so it doesn’t do anything. Finally, we’ll rename our field from “F1” to “Address”, then go to the worksheet.

By dragging our Address field onto rows, we can see that the data is just how we typed it in, everything all in one row.

Parsing Zip Codes

Our first task is going to be to parse out the Zip Codes (or at least things that look like zip codes), so we’ll start by creating a calculated field off of Address.

We’re going to be using the new REGEXP functions under the “String” category, and specifically the REGEXP_EXTRACT function. There are slight differences between all four of these regular expression functions, but the basis of all of them is using regular expressions to find some pattern of characters in a string. The “Extract” function returns whatever you’ve found and ‘captured’ (something we’ll talk about in a minute), “Extract Nth” returns the nth ‘capturing group’ you’ve defined, “Match” returns True or False based on whether or not the expression was found, and “Replace” replaces the expression with another string.

For now, we’ll be using the Extract function to pull out the Zip Codes from our Addresses. So let’s take a look at what that regular expression looks like:

The first argument is the string that you’re wanting to search, which is Address in our case, and the second argument is the regular expression you want to look for. A couple of notes about format:

The entire regular expression needs to be surrounded by quotation marks.

Inside the quotation marks, the part of the regular expression you want to capture (i.e., return) needs to be enclosed in parentheses. This is called the ‘capturing group’.

The core of this regular expression is just ‘d{5}’, which says to match exactly five digits (numeric characters) in a row. The ‘d’ is an escape character which means ‘match any digit’, and remember that brackets around a number mean ‘match exactly this many of the previous term’. Finally, we put parentheses around the whole thing because we want to return exactly that, and then quotes around all of that to make the argument a string. Clicking ‘OK’ and then dragging our new Zip Code calculated field onto the view gives us this:

And there we go! All of our zip codes are now parsed into their own field, with no trouble at all. Notice that our regular expression matches any string of five digits, though, including the ‘10000’ following the dollar sign. You can create expressions that are more complicated in order handle ambiguity or special situations like we have here, but for data that’s fairly clean and formatted it shouldn’t be much more complicated than what we had. I simply wanted to introduce some variation into our data to show how regular expressions handle the differences.

Parsing States

Very similarly, let’s write a short regular expression to match our States. The simplest way to do this is to just match two capital letters in a row, like this:

The [A-Z] matches any capital letter, and {2} means we want exactly two capital letters in a row. Dragging State into the view as well gives us:

Notice that once again, our regular expression can capture things we don’t intend, such as “MY” in row three. Of course, with adequately cleaned data this shouldn’t be too much of an issue.

Parsing Street Names

Let’s do one last example that’s a little more complicated. Let’s say we want to capture just the street name without any of the preceding numbers. So for row two, we would just want to return “Pie Street”. First, we’ll break down what we want our regular expression to match and return. We’ll want to find some numbers in a row, one or more: d+

Then, we’ll want a space: s

Next will be the street name, and this is what we’ll want to return. It could be multiple words: ([ws]+)

Finally, we’ll know we’ve hit the end of the street name when we see a comma: ,

Let’s put this all together into our regular expression:

Notice that by matching the full longer expression, but only returning the capturing group of ‘([ws]+)’, we’re able to more specifically control which part of the string we get back. If we had just tried using ‘([ws]+)’ without the larger context, we could have gotten something completely different back. By dragging the Street Name field onto the view, we can see that it worked:

Conclusion and Resources

All of this is just the beginning! Regular expressions are an extremely powerful tool for parsing out information from text. You can use them to find and return phone numbers, email addresses, IP addresses, and much more. As an example, I use regular expressions in my 10 Million Passwords Analysis to determine whether usernames and passwords are alpha-only, numeric-only, or alpha-numeric. Before regular expressions, that calculation would have been a total nightmare, now it’s just a few simple lines.

If anyone has any questions or comments, please feel free to leave them. If people want a more in-depth discussion of a certain aspect of regular expressions, I would love to write another blog post on it. As a final resource, I use this regex cheat sheet ALL the time – it’s a great resource if you can’t remember how to represent some kind of character in an expression.

Happy RegEx-ing!

Steven Rouk, Visualization Scientist

2 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *