Example: ENRON
By Mathias Ball
In order to test our e-mail import module and the word indexer we picked up the ENRON e-mail corpus and loaded it into our database.
ENRON was one of the biggest US energy companies with headquarter in Houston, Texas. In 2001 ENRON caused a big scandal due to continued balance forgery. During crime investigations all e-mails of the company were confiscated. In the meantime these e-mails are released for computer and programming science purposes.
For the testing of K4 we wrote a specific import module that reads the compressed ENRON e-mail corpus (without attachments) and writes it to the database. The e-mails were extracted, their headers analyzed and their content (bodies) indexed.
- Size of compressed corpus: 423 MB; decompressed: 2.6 GB
- Size of resulting Firebird database file: 10.5 GB
- 517,401 e-mails with 8,164,014 properties extracted from headers
- 773,431 rows in object blob table “DOK_OBJ”. Each e-mail is expected to create at least 2 rows in “DOK_OBJ”: One is required for the reconstruction of the original e-mail and the other one for the text body. If an email is read that has the same body as an e-mail which is already in the database, no new “DOK_OBJ” row is created. Instead a link to the existing one is created. Since the total number of rows in “DOK_OBJ” is significantly lower than twice the number of e-mails in the set, many e-mails must contain the same body.
- 106,192 rows in table “ORG_EDI_ADR”, i.e. different e-mail addresses extracted from the headers.
- 43,546 rows in table “ORG_PERS”, i.e. recipient and sender names extracted from the headers.
- 49,083 correlations between “ORG_EDI_ADR” and “ORG_PERS” created as determined from header extractions.
- The total number of different words extracted from email bodies is 1,064,722. For these words 28,328,384 connections have been established to email bodies stored in the table “DOK_OBJ”.
According to English Live the English language has up to 170,000 common words. During our test we extracted six times as many. Most likely, many extracted words are variants of the same word root. To test it, we searched for “government” as example in our word list and found 198 different words that contain “government” as part. Hence, the most useful way to reduce the indexed list of words is to apply a stemmer algorithm that identifies the root words.
Our current word extraction algorithm has some abilities to subdivide words into several categories. Results are:
- 252,237 words were recognized as dates and time. 164,230 contain am/pm.
- 98,446 words are e-mail addresses “@.*”.
- 80,613 potential “words” are longer than 50 characters and yet no URLs. They can be discarded, because they contain embedded e-mails and base64 encoded objects. Such cases demonstrate, that our Python based e-mail parser can still be improved in many ways.
- 36,866 words contain only numbers and +/- (mostly phone numbers)
- 26,654 words are URLs ( they include “http://”).
- 16,515 words contain HTML-Tags where we did’nt expected them.
- 7,889 words contain prizes starting with dollar sign “$”.
- 1,497 words start with an apostrophe.