K4 Presentation at Linux Tag 2019 in Chemnitz
In order to test our e-mail import module and the word indexer we picked up the ENRON e-mail corpus and loaded it into our database.
ENRON was one of the biggest US energy companies with headquarter in Houston, Texas. In 2001 ENRON caused a big scandal due to continued balance forgery. During crime investigations all e-mails of the company were confiscated. In the meantime these e-mails are released for computer and programming science purposes.
For the testing of K4 we wrote a specific import module that reads the compressed ENRON e-mail corpus (without attachments) and writes it to the database. The e-mails were extracted, their headers analyzed and their content (bodies) indexed.
- Size of compressed corpus: 423 MB; decompressed: 2.6 GB
- Size of resulting Firebird database file: 10.5 GB
- 517,401 e-mails with 8,164,014 properties extracted from headers
- 773,431 rows in object blob table "DOK_OBJ". Each e-mail is expected to create at least 2 rows in "DOK_OBJ": One is required for the reconstruction of the original e-mail and the other one for the text body. If an email is read that has the same body as an e-mail which is already in the database, no new "DOK_OBJ" row is created. Instead a link to the existing one is created. Since the total number of rows in "DOK_OBJ" is significantly lower than twice the number of e-mails in the set, many e-mails must contain the same body.
- 106,192 rows in table "ORG_EDI_ADR", i.e. different e-mail addresses extracted from the headers.
- 43,546 rows in table "ORG_PERS", i.e. recipient and sender names extracted from the headers.
- 49,083 correlations between "ORG_EDI_ADR" and "ORG_PERS" created as determined from header extractions.
- The total number of different words extracted from email bodies is 1,064,722. For these words 28,328,384 connections have been established to email bodies stored in the table "DOK_OBJ".
According to English Live the English language has up to 170,000 common words. During our test we extracted six times as many. Most likely, many extracted words are variants of the same word root. To test it, we searched for "government" as example in our word list and found 198 different words that contain "government" as part. Hence, the most useful way to reduce the indexed list of words is to apply a stemmer algorithm that identifies the root words.
Our current word extraction algorithm has some abilities to subdivide words into several categories. Results are:
- 252,237 words were recognized as dates and time. 164,230 contain am/pm.
- 98,446 words are e-mail addresses "@.*".
- 80,613 potential "words" are longer than 50 characters and yet no URLs. They can be discarded, because they contain embedded e-mails and base64 encoded objects. Such cases demonstrate, that our Python based e-mail parser can still be improved in many ways.
- 36,866 words contain only numbers and +/- (mostly phone numbers)
- 26,654 words are URLs ( they include "http://").
- 16,515 words contain HTML-Tags where we did'nt expected them.
- 7,889 words contain prizes starting with dollar sign "$".
- 1,497 words start with an apostrophe.
Over the last years communication between people has changed substantially by the use of smart phones and messengers like WhatsApp and Telegram. Nevertheless, communication by e-mail is still one of the most important ways of exchanging digital data and information. This is even more true for companies and authorities.
Hence, e-mail is a key element in the management of personal and private data. From a technical perspective a single e-mail contains communication content and besides of that also a lot of useful information in the header, for instance receive time, addresses and names of communication partners.
What can we do with that? We wanted to create a K4 module for e-mail which can:
- Store all e-mail content in the database, searchable.
- Store header information, searchable too.
- Store attachments, searchable as far as possible. In a first step we do that for pure text as well as for HTML pages only. In a further step for office documents and PDFs.
- Avoid redundancy in the database. Attachments are often sent several times per receiver or are multiplied by quoting or resending e-mail. But any attachment can be clearly identified by using hashes (SHA256). If an attachment reaches you a second time, only a reference is stored to the already existing object.
- Automatically extract and assign addresses and names. Semi-automatically create contact lists.
- Remember of the origin of each e-mail (account, folder).
- Read e-mails from and save to different targets: IMAP servers, maildir systems, MBOX files, Cyrus mail filesystem.
- Restore any e-mail cryptographically identical to its origin as well as to its original source. This also means that the e-mail module of K4 is able to archive and restore any e-mail as demanded by law (restoration is 100% the same) without wasting storage space, since duplicate attachments are always stored only once.
There are several Python modules that support parsing and creating e-mails with very nicely. Therefore we are able to disassemble an e-mail into parts like bodies and header lines using only one line of code. In order to analyze header information, bodies and attachments for storing it to the database, several encodings, code pages, languages and erroneous formats - most often contributed by junk mail or mailing lists - needed to be considered.
Currently, each text body part is tokenized and the tokens are provided to an index. In order to avoid the extraction of nonsense words we developed a simple algorithm that quite efficiently categorizes words as things like: URL, e-mail address, simple numbers, times and dates.
For testing purposes we read in about 700,000 distinct e-mails (including the ENRON corpus, see next post). During the implementation of the e-mail module we faced a lot of special cases which can occur in e-mails again and again - for instance non-standard date and time formatting, which had to be specially treated. It is safe to say that although there are standards defined in RFCs many mail clients and servers do not stick to them. But since K4 has learned all those particularities, it can handle any kind of e-mail quite well.
The pillars of K4 - An overview
Applications that collect searchable data (basically every messenger) usually use a SQLite database. SQLite is a very simple relational SQL database without any procedural programming capabilities. Datacentric applications, even those who use far more advanced database products, often view a database as a simple data storage container from which data can be retrieved very fast. If no attempt is made to use relational features, a database often is reduced to a centralized data store, used by some orthogonal persistence layer which is fed data by some middleware. This scenario probably is the most populare one today. A three tier application, where a middleware analyzes, organizes and manages all data, stores it into a database - which kind often is not even important - or provides it to a graphical user interface. The middleware owns the business logic.
The Database: Firebird
With K4 we use an entirely different approach. Relational database products originally aimed to provide complete solutions to any business requirements. Therefore most products provide powerful procedural programming capabilities - although in face of the market dominance of middleware applications they fell a bit into oblivion. However, since we know of the possibilities our approach is focused on using a relational database as means to organize the data and to implement business logic using relational logic and procedural language in a quite unique way. Therefore the first pillar of our solution we chose Firebird as a database using PSQL as a language respectively.
Why did we decide to go that way?
- We wanted to store all content in the database, searchable.
- A well crafted relational DB can provide really good performance when searching things. We wanted to profit from this as much as possible. And as it seems we do.
- All relations between entities are enforced by the database. The relations between entities are cast into well defined database structures too, so the database can not only enforce things, but also evaluate them. There is no external middleware. The database is the solely foundation for the entire application.
- A relational database provides a number of tools to enforce structure and rules. This is very much what we want. Based on this we designed an architecture for K4 which makes heavy use of predefined modular building blocks as well as rules allowing us to translate real world entities into data structures quite efficiently. Using relational structures and specifically indices in a smart way really helps achieve order and speed at the same time.
- A further benefit of the approach: When we are designing a structure for a new entity, we are really forced to understand the entity. We have to decide: Which data are really important for that entity? What should be searchable? What is additional information? What is needed to the things we put into the database in its original form?
- Without any question: Avoiding a three tiered is a question of choice. But one thing is for sure. With the power of a fully developed relational DB at your hands you get so many things for free that you soon realize that the K4-approach makes a lot of things a lot easier to handle.
The Process Manager: PM
PSQL delivers powerful tools to make use of enhanced database capabilities, but a database by nature of its design is a passive, standalone software component. It can not reach out and fetch e-mail from an IMAP server. To be able to do such things we created a separate tool which we call PM. The name PM originally comes from the term "process manager", since it was thought to be a tool to initiate transactions in the database. Essentially it is the external manager of all things which are prepared to go on in the database. Additionaly it analyses external entity data, breaks it down into database structures and provides it to the database. On request it is able to communicate things which are prepared by the database to the outside world. For instance, it can format and send an e-mail, if the database has prepared one. PM is entirely written in Python - a very powerful programming environment. PM is the second pillar of the K4 architecture. It handles:
- Database management functions (creation, backup, restore, migration).
- By the means of PM, the database is "surrounded" by a living cloud of predefined processes (like satellites), which perform tasks for the database. For instance, a process may be polling an IMAP mailbox for new e-mail. It fetches e-mails, parses them and calls the interface PSQL functions to put all e-mail data into the database.
- Data post-processing: When an e-mail is fetched, its attachments are analyzed for their content and all the words it contains are collected. This word list is provided to the database and logically linked to the e-mail object to allow to find this e-mail by search terms later on.
- Communication: A specialized process waits for e-mails to be sent. It extracts e-mail data from the database, creates a standardized e-mail-text-object and negotiates with an SMTP server to send the e-mail.
- Database process management: When a transaction in the database processes things, it may require to start a follow-up transaction. But this something a database can not initiate by itself. Instead it is dependent on external help. PM provides exactly that by catching a signal for a job from the database and initiating an upcoming transaction as required.
- Meta-processes: Meaning bigger tasks, as for instance reading in an entire e-mail folder from an e-mail-program like Thunderbird at once, reading all files from a specified folder, converting them to K4 database documents, importing vcards, importing book entries from a Tellico database (Tellico is a tool to manage collections), reading additional information from internet resources (for instance CDDB, Wiki), reading history files from messengers.
- The process system is fully configurable and modularized.
The Graphical User Interface: Clarissa
While having your data organized in a DB and having them fetched from, sent to and organized by a process management sytem is all well and good, as a user you will need a graphical user interface (GUI) to interact with your data. The GUI is a desktop application and represents the final pillar of the K4 architecture. It is named Clarissa and is written in Lazarus (Object-Pascal). Our requirements for Clarissa are: Easy to install - Easy to use - Easy to understand. For Clarissa we use a very generic approach. The application does not know any details about the data it presents or which you enter. All the intelligence how to handle the data are known to the database - and to some degree PM - only. Clarissa only cares about presenting data to you as well as letting you enter data comfortably and to search and link data together very easily. You as the user can further restrict results by filtering and ordering them. Furthermore Clarissa provides a powerful search system reminding of a web search, which enables you to search all kinds of things you have stored by search terms.
Technically this means that a set of data which you can see on any form are the direct result of a query to the database transformed to a visual representation. To achieve this generic datasets initiate database queries which deliver all the results that are available. The GUI is built from building blocks which are configured via JSON-files.
The Principle of Interaction
The following picture illustrates the operation of the K4 system based on how its pillar components interact. The arrows in the picture display the communication lines between the GUI and the database as well as the database and PM processes.
The K4 concept is a variant of the Model-View-Controller paradigm which is known from GUI software design. "Model" is realized in the database pillar. The Model in our case is built on "entities" or "real world things". The "Controller" is realized by PM, handling import, export and processing of entity data. The Clarissa-GUI is the K4-equivalent of the "View" on the data.
K4 in a Nutshell
- K4 is a new (as of 2018) cross-platform database application which allows it to manage all kinds of personal data
- K4 uses advanced database capabilities like procedural SQL to integrate data structures and business logic into database layer instead of creating a middleware solution that acts as an additional layer between data storage and user interface
- K4 uses a software package separate from the database called PM (written in Python 3) to harvest data and to structure it before writing it to the database as well as to extend abilities of the database to manipulate and analyze data
- K4 features a generic graphical user interface written in Object-Pascal that is responsible for searching, presenting and maintaining data
- K4 GUI aims to let the user do simple things in a simple way
- K4 promotes the idea of "Tiny Data" (as opposed to "Big Data"): Acquire as little data as necessary instead of as much as possible
- K4 currently runs on Linux and Windows with Firebird as database
- The term K4 or the K4-website are not related to other tools with similar names. The K4 package, as published on this website, focuses on providing a personal data management application.
Tiny Data - How to?
Imagine the following situations.
A few months ago, you had a great idea and wrote it down in an electronic file. Suddenly you need your notes, but how do you find them? Because somehow they are not in the place you remember them to be. And your operating system doesn't know about a thing called "note". But you could still search your entire drive...
Or how about that e-mail you received a while ago which included those important documents. Of course you just don't remember who sent it to you or when it was sent or any useful word of the context. Must it really be so difficult to find things quickly when you need them?
Not to forget this book you always wanted to buy. And after you bought it you soon realize that you already own it. If only you could have known that...
Where these examples come from there are countless more. Of course there are already applications that manage e-mails, notes, documents, inventories of books and other objects - but is there a single one which handles all of it? What's more, is there such an application, that also keeps my data private? And could it still allow me to browse my data as comfortably as if it were on the web?
Interesting, but we have not arrived yet!
So the idea unfolds
When big data is associated with the collection of huge, rather unstructured datasets, which are subsequently evaluated in an arbitrary way using automatic mechanisms, can a desktop application with a small database even provide comparable benefit? What if we trade the potential benefits of large unstructured amounts of data against the tangible benefits of manageable structured ones? What if our few, locally stored data were deprived of the access of large centralized services? Will we still be able to effectively search them, like we could with a large web-based search engine? What is it worth if we can decide ourselves with whom we share which of our private data? Is it even conceivable to decentralize the Internet more strongly without losing functionality we already love and use on a daily basis?
We think this may be possible. And if you look around, you may find others to think in similar ways. Look at this for example: Web Decentralisation Project.
So what to do instead of Big Data? We introduce as an alternate approach (drum roll): Tiny Data!
So the question arises: How to do Tiny Data?
How to organize personal data in a manner to keep them private and to extract as much benefit as possible? The idea of Tiny Data has evolved into a concept for a real application. This application will make use of an architecture based on a relational database, an accompanying process management system and a graphical user interface. Our languages of choice are PSQL, Python 3 and Object-Pascal (so far). In subsequent posts we will explain the system and its components in more detail. Furthermore, we will demonstrate the progress on the way based on concrete things we want to manage.