SQLite CVSTrac

Introduction

The module fts2 is a new version of the SQLite FullTextIndex module. At this time, usage is identical to Fts1. The data storage has been radically changed in a way which greatly improves insertion speed, at minor cost to query performance (which may yet become "at no cost"!). fts2 is in the SQLite CVS tree at ext/fts2.

In general, building, linking, and otherwise using fts2 is identical to fts1, and will not be covered, here. Both fts1 and fts2 can safely be built/loaded into the same binary.

Caveats

There is no provision for backwards compatible with fts1 data. Data can be transferred from an fts1 table to an fts2 table by creating a new table using the new module, and doing an insert...select statement.

The fts2 storage format is not yet to be considered final, and it is very likely to change multiple times in the near future. Unfortunately, at this time there is no way to load old and new versions of fts2 at the same time, so if the format does change, the only option is to do a .dump.

Performance

fts1 uses a scheme of segmenting tokenized data into per-term doclists which were stored in a %_terms table. The segmentation allowed newer data to be stored in smaller doclists in order to control update costs. Unfortunately, it means that document inserts at minimum require read-modify-write against every term in the document, plus operations to merge term segments that grow too large. Importantly, this meant that the cost to insert a new document was proportional to the number of tokens in the new document multipled by the number of tokens already in the database.

fts2 changes to a scheme where data is segmented into document groups, and updates occur by merging document groups together. New documents form singleton segment, stored as a series of blobs, which is very cheap and does not degrade (much) over time. Segments are periodically merged together to moderate the query cost (which rises proportional to the number of segments). Term data within a segment is grouped together and written in sorted fashion, making segment merges reasonably cheap.

In a test which loads the Enron email corpus (1.4G of data across 517,431 documents), fts1 required 13.5 hours, while fts2 required 35 minutes. This was with pagesize=4096, synchronous=off, 100 inserts per transaction.

See the top-of-file comment in fts2.c for an in-depth description of how things operate at this time.

Goals

The primary goal of fts2 development is to make it feasible to store hundreds of thousands of documents, up to perhaps a million documents. There's no reason that this couldn't go higher, fts2 simply doesn't have the tools in place to manage issues that arise with such large datasets. For instance, if there were a million documents in the Enron dataset, the million document merge would read and write back about 1.6G of data during the course of a single update. (Yes, there are many ideas in the air for how to handle this!)

FtsTwoNotes - notes relating to fts2 development.

sqlite - Fts Two