We created Timestream to help organize complex cases and investigations into a portable, shareable format, while also maintaining the story and chronology of the case on a timeline, keeping everyone on the same page. This blog post details why we made certain design decisions related to facilitating collaboration and outlines a SQLite model for storing case chronology (aka how to model Git in SQLite).

Note: We assume that you have some rudimentary knowledge of relational database structure and basic familiarity with Git.

Timestream Requirements

We knew we had to fulfill four distinct requirements for Timestream:

  • Our system had to be able to store the complete chronology of a case and all of the supporting documents inside a single file.
  • The application needed to be portable, requiring no installation process and compatible with both Mac and Windows systems.
  • Users had to be able to work together and share cases with others.
  • Finally, Timestream needed to work offline, completely disconnected.

Why SQLite?

SQLite support is ubiquitous across all major operating systems. This was very important given our multiple OS and no installation requirements. SQLite is also able to store large BLOBs inside of a database column, which was ideal for fulfilling our portable case file format. Given these requirements, SQLite was a no-brainer when compared with writing our own file format or using something like a zipped directory to store the Timestream case data.

The Problem With Centralized Revision Control

We knew our server had to facilitate collaboration, but users had to be able to work offline as well. Using a Google Docs-style live update system was impossible. At first we attempted to employ a centralized server that handed out revision dates for each item in the case file. Using this method:

  • A user would initiate sync with server.
  • Timestream would ask the server for everything after the maximum revision date stored in the case file.
  • Then look up each “new” object from the server and figure out if it was add or modify.
  • Deletes were really modifies, as we just used a flag to mark an object as deleted.

However, a requirement arose that a user should be able to sync the same case with different servers. We realized a centralized server approach would be problematic. If the clocks on the servers were not exactly in sync, there could be errors during the merge.

Enter The Revision Graph (aka how Git works)

We use Git as our source code management system. One of its most valuable features is how it elegantly solves the challenge of managing versions of code without the need for a centralized server. Since we were faced with a similar challenge, we learned as much as we could about the underlying concepts of acyclic directed graphs which Git implements so usefully.

Let’s take a brief look at how Git works.

To simplify things a bit, Git stores a “snapshot” of all the files in your source directory each time you make a commit. For space efficiency, Git will not store a file again if it is identical to the previous revision’s version, instead it stores a link to the previous commit’s version. Each revision stores a link to its parent as well. In the event of a merge, the merged revision stores a link to each of its parent revisions. These revisions and the links between them (defining the parent relationships) create a directed acyclic graph.

Figure 1
Figure 1

As you can see in Figure 1, revision A is the parent of revision B. At revision B, the graph diverges, which in our case, means that two different users, users 1 and 2, edited it, creating revisions C and D, respectively. At that point user 1 syncs his case to the server. When user 2 syncs his case, a merge revision E is created.

Figure 2
Figure 2

As illustrated in Figure 2, storing revisions in this way would allow users to sync with many different servers, creating entire new branches of the graph, and then merge those new branches into the case graph.

Git also extensively uses hashes to ensure that each revision has integrity. In short, Git stores the content of each file and the hash of that content. The actual filename is only metadata and each revision is stored using its revision hash. A revision hash is the hash of the string created by concatenating the hash of each file in the revision. This ensures that there can be no error in transmission when syncing in Git because Git checks the hashes when it receives the new information. Read more about how Git works here.

The SQLite Model

All of this sounds great, but Git works on files in a filesystem and creates its own database to store the revision graph; how can we use the ideas from Git to store Timestream revisions in a SQLite datastore? I’m glad you asked…

The stage Table

In Timestream, each object in a case (events, files, tags, etc.) has a GUID. This GUID is analogous to a filename in Git. As stated above, Git actually is more concerned with the content of a file and that hash of that content. So we needed to come up with a consistent way of representing and hashing object content. We chose to store all case information in the same table, called stage. The definition is below:

Stage Table

The obj_guid field is the generated GUID for each object in the case. The data field is a JSON representation of the object. An example is below:
{ “obj_type”:”event”, “start_date”:”2013-04-18 21:00:00″, “end_date”:”2013-04-18 22:00:00″, “title”:”Event title”, “summary”:”Event summary” }

When we save each object to stage, we hash the data field and store the result in obj_hash. In the same way that the obj_guid field is analogous to a filename in Git, the data field is analogous to the contents of a file in Git. This will become important later when we talk about the obj table.

stage is a working area in which the user can make changes to the case. Each time a user edits an object in the case using the Timestream Desktop app, we update the data field representing that object and the obj_hash with which it is associated.

When a user chooses to sync, we create a revision from stage by concatenating the obj_hash fields in stage into one long string and hashing that string to get the revision hash. It’s important to order the obj_hashes in some way before creating the revision hash since a different order would generate a different hash.

What do we do with the revision hash? Glad you asked…

The revision Table

The revision table stores information about each revision in the case. This table is the glue that holds the rest of our distributed revision control together. Before we get into too much detail, let’s have a look at the definition:

Revision Table

The revision_guid field is a generated GUID for the revision. The revision_hash field is the one outlined in the last section. The parents field stores a JSON array of each of the revision_guids of the parents of a revision. Using the parents field and the revision_guid field, we can construct the directed acyclic graph explained earlier. We also store the revision_date, author, and message for each revision.

Git actually uses a hash of the revision hash and some other revision metadata (author, creation date, etc.) to identify a revision. We decided that storing the revision hash and creating a GUID for the revision would be strong enough to retain revision integrity. In other words, if someone were to change the revision date, the revision would still pass integrity checks. But if any object in a revision were changed, the revision would fail. We opted to keep the revision hash as simple as possible to minimize the chances for inconsistencies when creating hashes on different systems.

So this looks pretty good, but how can we know which objects are in which revisions? I’m glad you asked…

The revision_obj Table

Revision_obj Table

The revision_obj table defines the objects in a revision. The definition is below:

The revision_guid field is a link to the revision table; and the obj_hash and obj_guid fields again come from stage. We store the obj_hash in this table in order to avoid having to look it up when calculating a revision hash to verify the integrity of a revision received from the server.

So it seems something very important is missing from our model so far… the data field! So where does the content of each object get stored? As I’m sure you’ve guessed by now, I’m glad you asked…

The obj Table

The obj table keeps track of all the objects that have ever been in any revision of a case. Like in Git, an object is only stored when it is created or changed. In other words, if 5 revisions have the same copy of an object, only one copy is stored in the obj table. The definition for obj is about as simple as it gets:

Obj Table

Since we only care about the contents of the objects, there is no reason to store the GUIDs in the obj table. The obj_hash and data fields in this table are exactly the same values as they were in stage when the first revision that contained this object was created.

Conclusion

Wonderful but how does this work in the real world?

Let’s say a user is offline and modifies a Timestream case.

  • Timestream stores the changes in the stage table of the respective case file.

User gets online and chooses to sync with the server.

  • Timestream creates a revision from the information in stage. Timestream asks the server for the revision_guid of its most recent revision.
  • If the server’s most recent revision is not an ancestor of the revision the client just created (see Fig. 3), Timestream performs a merge of the server’s most recent revision and its new revision. This new merged revision is then pushed up to the server and becomes the server’s most recent revision. See Figure 3 for a full illustration of a situation under which a merge would be necessary.
  • If the server’s most recent revision is an ancestor of the new revision, the new revision is simply pushed up to the server and the server updates its most recent revision accordingly.
Figure 3
Figure 3