Wednesday, January 24, 2007

Enron Email Corpus

I've found a corpus of the enron emails that were made public after the trial. It's a huge download and I haven't successfully actually gotten in there so that i can see the emails, but it IS a corpus :)

link to corpus

Please let me know if yall can actually get access to the emails.

thanks for the link, josh. you realize that the major collections of discourse we've linked to here have come out of scandals most dark, though ;)
Hmm. Not sure my comment actually took, the first time -- Firefox has been acting up and I have to use IE for anything Google or Blogger related. Please remove this comment if it's a duplicate!

Yes, I've been immersed in that corpus for a while -- it's one of the sources I use for my PhD dissertation :-)

It is huge to be sure. When I first downloaded it and unpacked it, before extracting a subcorpus, I had to keep it on my iPod as I didn't have enough space on my computer.

-Linnéa Anglemark, Uppsala university
In addition: the Enron email dataset is a corpus in some respects, but it isn't tagged or anything. When you download it, you get all the email "sorted" into approximately 4700 folders (directories) ; the email comes from 150 different people and their mailbox directory structure has been preserved, so you get the 150 first-level directories named after each user, and then nested folders depending on how much they sorted their email. (Which is an interesting study in itself, I can tell you!)

I'm not able to read the documents. How are you accessing the files?
I never had any problem opening the files, josh - they open in a simple text editor for me. (It is possible that I had to add a .txt extension to the file names first, but I don't think so.) Once you download the dataset you need to unpack it, but after that the files were accessible to me.

There is another search tool (apart from the two that are linked from the site you linked to, of which one no longer works) at if you want to avoid downloading the whole thing.
I got them to open fine, too - no .txt adding or anything.

This is so. Cool.
