rafekettler.com

My thoughts on programming and technology

Analyzing the LulzSec Password Leak

06.16.2011 at 11:30 AM in security | View Comments

Maybe there's something wrong with me, but when I first heard about LulzSec releasing 62,000 passwords, I was actually pretty excited. I've always wanted to a little analysis on a big leak like this, and now I finally get to do one.

So, as a brief overview, I'm going to take a look at a few different things: password frequency, password length, and password complexity, and see how the people in the link were doing security-wise.

Getting the passwords into one place

The original text document was not perfectly formed. It took a bit of tweaking to get just the passwords. First off, there was some chatter at the top that I had to remove. Also, part of the document was formatted password | email |, part was formatted password | email, and another part was formatted number | password | email, so I had to change that as well. I first replaced all instances of | (a pipe with a space on each side) with a single space, to make the input more digestible for awk. Then I did this:

$ awk '$1 ~ /.+@.+\..+/ { print $2 } $2 ~ /.+@.+\..+/ { print $1 } $3 ~/.+@.+\..+/ { print $2 }' ~/passwords.txt > ~/justpasswords.txt

Basically, I decided which field to print based on what field looked like a password (had some text, followed by an @, followed by more text, then a ., then more text). I lost about 13 passwords in the process, but that's not too big a deal. I also got one or two lines in justpasswords.txt that were actually emails, but that's okay as well.

Anyway, there I had it: just the passwords. It was time to do some analyzing.

Looking at password frequency

First, I decided to figure out what the most common passwords were in the set. I tested this using the following Python:

#!/usr/bin/env python

freqs = {}
with open('justpasswords.txt') as f:
    for password in f:
        password = password.strip()
        try:
            freqs[password] += 1
        except KeyError:
            freqs[password] = 1
# Get the top 25 most common passwords.
for password, freq in sorted(freqs.items(), key=lambda x: x[1], reverse=True)[:25]:
    print "%s was used %d times" % (password, freq)

The output came out like this:

123456 was used 558 times
123456789 was used 181 times
password was used 132 times
romance was used 88 times
102030 was used 68 times
mystery was used 67 times
tigger was used 62 times
shadow was used 61 times
123 was used 55 times
ajcuivd289 was used 55 times
bookworm was used 54 times
dragon was used 53 times
sunshine was used 53 times
reader was used 52 times
12345 was used 50 times
purple was used 50 times
maggie was used 48 times
reading was used 47 times
 was used 43 times
1234 was used 42 times
vampire was used 34 times
peanut was used 34 times
angels was used 34 times
booklover was used 33 times
michael was used 32 times

Okay, so there's some usual suspects in there. No surprise that 1234, 12345, 123456, 123456789 or password are there. But some of them are a bit weird -- like ajcuivd289. Why does that show up? A quick grep of the original password file reveals that each of the emails belonging to this particular password is a generic female (mostly Anglo or Hispanic) name at either a variation on gmail.com, mail15.com, or mail333.com. I suspect we can chalk this up to one person or entity having 55 accounts.

Also interesting was that a blank password was used 43 times. This probably has to do with errors in my earlier processing of the data.

Basically, this reveals what we already know: a lot of passwords are very common and based on dictionary words. But only about 4000 out of 62000 passwords total had passwords used 15 or more times in the set, so perhaps these passwords aren't as weak to dictionary attacks as we thought.

Password length

Next, let's look at how long these passwords are on average. Here's another Python script to calculate that:

with open("justpasswords.txt") as f:
    lines = f.readlines()
    average = sum(len(line.strip()) for line in lines if line) / float(len(lines))
    print "Average password length is", average

The average length of a password was 7.63 characters. That's not terrible, but it's not good either. Obviously, whatever source these passwords came from was not doing a very good job of making sure that their users used good passwords.

Password complexity

Finally, let's test to see how complex the passwords are. I tested for lowercase, uppercase, digits, and symbols using this script:

upper, digits, symbols, total = 0, 0, 0, 0.0
with open("justpasswords.txt") as f:
    for password in f:
        total += 1
        password = password.strip()
        if any([str.isupper(x) for x in password]):
            upper += 1
        if any([str.isdigit(x) for x in password]):
            digits += 1
        if any([not str.isalnum(x) for x in password]):
            symbols += 1
    avg_upper, avg_digits, avg_symbols = upper / total, digits / total, symbols / total
    print "Percentage with uppercase: %2.3f" % (avg_upper * 100,)
    print "Percentage with digits: %2.3f" % (avg_digits * 100,)
    print "Percentage with symbols: %2.3f" % (avg_symbols * 100,)

We got this output:

Percentage with lowercase: 79.613%
Percentage with uppercase: 2.010%
Percentage with digits: 55.831%
Percentage with symbols: 1.565%

So, basically, most of the passwords have lowercase or numbers (but not necessarily both) and very few have uppercase or symbols. That does not bode well for password complexity. Let's look for passwords that only pick from one character set:

lower, upper, digits, symbols, total = 0, 0, 0, 0, 0.0
with open("justpasswords.txt") as f:
    for password in f:
        total += 1
        password = password.strip()
        if all([str.islower(x) for x in password]):
            lower += 1
        if all([str.isupper(x) for x in password]):
            upper += 1
        if all([str.isdigit(x) for x in password]):
            digits += 1
        if all([not str.isalnum(x) for x in password]):
            symbols += 1
    avg_lower, avg_upper, avg_digits, avg_symbols = \
               lower/ total, upper / total, digits / total, symbols / total
    print "Percentage with all lowercase: %2.3f%%" % (avg_lower * 100,)
    print "Percentage with all uppercase: %2.3f%%" % (avg_upper * 100,)    
    print "Percentage with all digits: %2.3f%%" % (avg_digits * 100,)
    print "Percentage with all symbols: %2.3f%%" % (avg_symbols * 100,)
    mixture = (1.0 - (avg_lower + avg_upper + avg_digits + avg_symbols)) * 100
    print "Percentage with a mixture: %2.3f%%" % (mixture)

And the output:

Percentage with all lowercase: 43.108%
Percentage with all uppercase: 0.364%
Percentage with all digits: 19.536%
Percentage with all symbols: 0.078%
Percentage with a mixture: 36.914%

Shucks. That's not good. 20% of the passwords are only drawing from a character set with 10 possibilities. 43% are only drawing from lowercase, which has 26 possibilities. This combined with an average password length of 7.3 characters or so is troubling. It's good to hear that 37% of the passwords are mixed, but the majority are still insecure (and remember, there are very few passwords that use uppercase or symbols, which would dramatically increase security compared to adding a number or two to a lowercase password).

Conclusion

The passwords that LulzSec gave us weren't quite as bad we'd expected, but they weren't secure either. Clearly, the source of these passwords did not enforce password security as much as they needed to, judging by the number of passwords that were all lowercase or all digits and exceedingly short. Web developers: force your users to use long and complex passwords. It's good for them. Users: use better passwords.

blog comments powered by Disqus