10 Million Passwords
When Mark Burnett released a dataset of 10 million usernames and passwords scraped from online (don’t worry – they’re probably dead, they’ve been online for a while anyway, and he even did some additional cleaning), some things weren’t much of a surprise: “123456” is still the most common password (with “password” close behind), and the strength of the average user’s password is still pretty abysmal. What definitely wasn’t expected was the amount of profanity and religious references – although some of my numbers are a little skewed from counting the “god” in “godzilla” and the “hell” is “michelle” (more on that later).
Enough talking for now, get exploring!
In this post I’m going to talk about the trickier and more interesting aspects of developing this visualization, such as data management for larger datasets, the new support for regular expressions in Tableau 9.0 that I used to determine whether usernames and passwords had numbers or symbols in them, how I worked around Tableau Public’s 1,000,000 row limit to publish the aggregated results for all 10 million rows, and a little discussion on the joys and difficulties of the word analysis.
Data Management: Millions and Millions of Rows
Ten million usernames and passwords didn’t seem like that much data when I started (~185 MB), but I quickly realized the toll my poor laptop was going to take if I tried returning millions of marks in Tableau. My co-workers Chris and William helped me learn how to develop worksheets more effectively when dealing with large datasets.
The primary method used to control the data was to create a much smaller extract from a subset of the data. This was easy to do in Tableau, as there is an option to only include the “Top N” rows when creating an extract – an option that you can change later, once you want to include all the rows. By developing with a smaller subset and then testing with larger numbers of rows, it was easier to see where performance would suffer.
Essentially, the views took the longest to load when I was trying to look at millions of rows at once. For example, creating a text table of all of the usernames and passwords would be a bad idea. On a faster machine, it may not matter – on my laptop, it definitely did.
However, if you’re applying an aggregation to millions of rows – for example, returning how many usernames start with the letter ‘s’ (983,011) – that is displayed much more quickly because Tableau doesn’t have to render all of those usernames. It just counts them up (which is much faster than rendering images on the screen) and returns a single number.
When in doubt, use a small extract of the data and build with that instead. When you’re not sure if some intermediate operation is going to try returning billions of things to the view, click that handy little “pause” button at the top of the worksheet to keep the view from refreshing with each new change you make. When you’re ready, you can click the button again and allow the view to generate.
Regular Expressions: The Greatest Things Ever
One of the greatest new features in Tableau 9.0 (in my humble opinion) is the support for regular expressions. For those of you who don’t know what regular expressions are, they’re basically data magic that allow you to find any string you want in some block of text based on conditions, instead of stating explicitly what you want the string to be.
For example, let’s say I want to find things formatted like email addresses in some huge book or website. I could try searching for the “@” symbol, but that’s going to take a lot of work to actually find all of the full email addresses. Here’s an example showing how to use a regular expression inside of Tableau to find an email address in a string. (The regex shown returns anything formatted as “email@example.com”.)
So the string that we pass in here is “My email address is firstname.lastname@example.org”, and the regex quickly matches and returns just the email address portion of the string. If you’re not used to regular expressions and this looks like a string of gibberish to you, don’t worry! I have limited experience with them as well, but it’s very easy to get up and running with the basic functionality you need to do cool things.
For this 10 Million Passwords analysis, I used regular expressions to divide all of the usernames and passwords into categories: “Alpha Only”, “Numeric Only”, and “Alpha-Numeric” (everything else). Although Tableau might have been capable of this in the past, it would have involved a nightmarish calculation looking something like this:
Instead, I was able to do it with this simple calculation.
The first line determines if the string is only numbers, and the third line determines if the string is only letters. Everything else is labeled “Alpha-Numeric”. Stupidly simple.
Using regular expressions also helped me fix a bug I hadn’t noticed in my original equation for finding the “Numeric Only” strings. I had originally used this calculation, which only divided up the passwords and usernames into “Numeric Only” and “Alpha-Numeric” (everything else).
Not only does this not include the “Alpha Only” category, it also had the unintended result of not including numbers with a leading zero as “Numeric Only”, since that zero wouldn’t be carried through once it was turned into an integer. So, we were ending up with things like “01234” in the “Alpha-Numeric” category, since STR(INT(“01234”)) = “1234”.
One quick note about using regular expressions in Tableau: I haven’t seen this in other uses of them before, but Tableau requires you to put parentheses around the part of the regular expression you want to return. For example, a regular expression of ‘[a-z]’ would return Null, but ‘([a-z])’ would return a lowercase letter. Just make sure to put parentheses around the part you want to return! (Which in my case was the whole thing. Notice in the above regular expressions, there are parentheses around the whole expression inside of the single quotes.)
Tableau Public Row Limit Work-Around
After attempting to publish to Tableau Public the first time, I discovered that Public has a limit of one million rows. In order to work around this, I created a new data source with aggregated data in it (about 5000 rows worth), then used this data to power the aggregated worksheets.
The aggregated data was essentially just copied from each sheet in the original workbook (CTRL-C when viewing the worksheet) and pasted into blank rows in the new datasource under the correct column headers. Maybe not the cleanest or most elegant solution, but it got the job done.
Next, I wrote a Python script to create a random sample of 900,000 of the 10 million rows, so that viewers of the viz could still see examples of usernames and passwords with certain characteristics.
Tableau can create an extract with the top N rows, but since the rows were sorted by username I wanted a selection of usernames and passwords that would be more representative of the full dataset. If anyone knows a way to easily create random subsets of data in Tableau, feel free to let me know.
There were too many words in this dataset that I wanted to look for in the passwords, so I decided to pick a few categories and see what cropped up. I ended up going with the “Emotive”, “Religious”, and “Profane” categories of words because they were fruitful and, well, interesting. On the bright side, it looks like “Love” wins. The downside is that the f-bomb comes in a close second, beating out all of the religious words combined. For now, we won’t think too much about what that tells us about society.
One difficulty in doing the word analysis, though, is that it’s sometimes difficult to tell the context that a string of letters is used in. The f-bomb is actually a good example of a unique word that doesn’t get a lot of crossover, but words like “hell” and “love” are often contained in other words with completely different meanings. We can see a few examples in the viz. For the most part, all of the words containing the string “love” are using it in that context.
The top hit for “hate”, though, is actually “whatever”, and it’s ahead of all other references by a long shot.
Finally, when we get to “hell”, most of the words found use that string in a totally different context.
Michelle probably doesn’t want to be at the top of this list. Thus, we’re left with a conundrum — how do we find these words used in the context we want, without also including all of these other words that just happen to contain the same string of letters? We could go through and manually exclude words we don’t want, but that’s not feasible if we’re dealing with a large dataset and a lot of words to check. For now, I’ve chosen to leave everything in here so that people can explore this idea a little.
The release of this dataset has helped me personally by making me more aware of how to keep personal information protected online. In order to learn more about protecting yourself and your information online, here are some resources that I found helpful:
Or you can continue being an internet noob with weak passwords. Your choice. (Don’t be a noob.)
Steven Rouk, Apprentice | Data Scientist