Handling E-Mail
By Mathias Ball
Over the last years communication between people has changed substantially by the use of smart phones and messengers like WhatsApp and Telegram. Nevertheless, communication by e-mail is still one of the most important ways of exchanging digital data and information. This is even more true for companies and authorities.
Hence, e-mail is a key element in the management of personal and private data. From a technical perspective a single e-mail contains communication content and besides of that also a lot of useful information in the header, for instance receive time, addresses and names of communication partners.
What can we do with that? We wanted to create a K4 module for e-mail which can:
- Store all e-mail content in the database, searchable.
- Store header information, searchable too.
- Store attachments, searchable as far as possible. In a first step we do that for pure text as well as for HTML pages only. In a further step for office documents and PDFs.
- Avoid redundancy in the database. Attachments are often sent several times per receiver or are multiplied by quoting or resending e-mail. But any attachment can be clearly identified by using hashes (SHA256). If an attachment reaches you a second time, only a reference is stored to the already existing object.
- Automatically extract and assign addresses and names. Semi-automatically create contact lists.
- Remember of the origin of each e-mail (account, folder).
- Read e-mails from and save to different targets: IMAP servers, maildir systems, MBOX files, Cyrus mail filesystem.
- Restore any e-mail cryptographically identical to its origin as well as to its original source. This also means that the e-mail module of K4 is able to archive and restore any e-mail as demanded by law (restoration is 100% the same) without wasting storage space, since duplicate attachments are always stored only once.
There are several Python modules that support parsing and creating e-mails with very nicely. Therefore we are able to disassemble an e-mail into parts like bodies and header lines using only one line of code. In order to analyze header information, bodies and attachments for storing it to the database, several encodings, code pages, languages and erroneous formats - most often contributed by junk mail or mailing lists - needed to be considered.
Currently, each text body part is tokenized and the tokens are provided to an index. In order to avoid the extraction of nonsense words we developed a simple algorithm that quite efficiently categorizes words as things like: URL, e-mail address, simple numbers, times and dates.
For testing purposes we read in about 700,000 distinct e-mails (including the ENRON corpus, see next post). During the implementation of the e-mail module we faced a lot of special cases which can occur in e-mails again and again - for instance non-standard date and time formatting, which had to be specially treated. It is safe to say that although there are standards defined in RFCs many mail clients and servers do not stick to them. But since K4 has learned all those particularities, it can handle any kind of e-mail quite well.