Page History

Full-text indexing for SQLite
Status: Draft (as of 25 Apr 2006)

Objective

Design and implement a full-text indexing and query system to be incorporated into SQLite. It should provide a number of useful features, but it should strive for efficiency and simplicity rather than the all-encompassing generality offered by implementations such as CLucene.

Background

Full-text indexing has also been incorporated into MySQL (http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html), PostgreSQL (http://techdocs.postgresql.org/techdocs/fulltextindexing.php), and Microsoft SQL Server (http://support.microsoft.com/default.aspx?scid=kb;%5BLN%5D;323739). Their interfaces and designs, where available, will likely be useful in planning the corresponding SQLite module.
Interesting sets of test data include the Enron email dataset (http://www.cs.cmu.edu/~enron/), various TREC (Text Retrieval Conference) data sets (http://trec.nist.gov/data.html), the Calgary text-compression corpus (http://links.uwaterloo.ca/calgary.corpus.html), and assorted International English corpora from Cambridge (http://www.cambridge.org/elt/corpus/international_corpus.htm).

Overview

This project comprises two primary services: indexing and querying. Broad goals include

Indexing

Fast indexing in (soft) bounded time and memory
Small index size, primarily to improve speed by reducing disk I/O
Integration with SQL and especially with SQL transactions

Querying

Google-like query language
AND (perhaps implicitly), OR, NOT
Phrase queries
Field qualifiers (to:, from:, etc.)
Support for contextual snippets, although the snippets themsleves may not be provided
Sufficient integration with SQLite to allow complicated SELECT statements and table joins
Adequate speed; extreme query speed is not a goal of the project, since indexing speed is a much higher priority in the expected uses.
Relevance metrics are not a goal of the project, since relevance is highly application-dependent.

Detailed Design

Many more issues must be considered before a more detailed design will be possible. Among them are the following:

Interface and hardware requirements

Text encoding and internationalization
RAM, CPU, and disk usage
OS support: Windows 9x/ME?

Core data structures and algorithms

API provided
Trie (prefix tree: http://en.wikipedia.org/wiki/Trie) vs. B-tree (http://en.wikipedia.org/wiki/B-tree) vs. something else
Unified or layered data
Update scheme: based on expected usage patterns, the module should be designed for frequent incremental updates of the index, rather than occasional, monolithic index revisions.
Compatibility with Lucene (http://lucene.apache.org/) and CLucene (http://clucene.sourceforge.net/)
Indexing data stored in the SQLite table or index; in the SQLite database file but separately; or in an entirely separate file
Text to be indexed required to be stored in a SQLite table, or allowed to be provided from some external source
Synchronous or asynchronous operation

Features

Built in, available via a plug-in component, or ignored?

Word segmentation rules
Prefix matching
Substring matching
Case folding
Combined case-sensitive and -insensitive matching
Punctuation matching
Stemming
Stop words

Proposed Interface (2006-Apr-28)

A full text index appears to the programmer to be a single table. This table is not directly realized on disk, however. It is a virtual table - similar in concept to a VIEW except that it is implemented using C code rather than as a SELECT statement. See VirtualTables for more information. The full text index table is an interface to C routines that do the appropriate searching, not a materialized table. Obviously, there must be some kind of real backing store - the backing store just is not a direct realization of the index table.

Assuming the full text index is named EX1 then the table appears to have the following definition:

   CREATE TABLE ex1(
     docid INTEGER PRIMARY KEY,
     doctext TEXT,
     score REAL,
     snippet TEXT,
     pattern TEXT
   );

There is a special API that creates this table. It is not created by executing a CREATE TABLE statement. The CREATE TABLE statement above is there to demonstrate the structure of the table, not to show how the table is created.

The actual storage for the full text index might be in the original database file or it might be in auxiliary files. That decision is made in the code that implements the virtual table and is transparent to the user.

Add documents to the index by issuing SQL statements like this:

   INSERT INTO ex1(doctxt) VALUES(:docbody);

Because EX1.DOCID is an integer primary key, a new value for the document id is assigned automatically and is accessible using sqlite3_last_insert_rowid() in the usual way.

The EX1.SCORE, EX1.SNIPPET, and EX1.PATTERN columns are intended for reading, not writting. Attempts to write anything into those columns (other than NULL) will result in a constraint error.

The full text index may or may not store the original document text. If the original document is not stored, then attempts to read the EX1.DOCTEXT column will return a NULL.

The text segmenter, stemmer, and the stop-word list are all built into the indexer. To use a different segmenter, stemmer, or stop-word list, register a different indexer object. If you have a situation where you what multiple segmenters to be used in a single index, then you can define an additional column in the virtual table that specifies which segmenter to use. For example:

   CREATE TABLE ex2(
     docid INTEGER PRIMARY KEY,
     doctext TEXT,
     doctype TEXT,  -- Specifies which segmenter to use
     score REAL,
     snippet TEXT,
     pattern TEXT
   );

Then

   INSERT INTO ex2(doctype,doctext) VALUES('html', :docbody);

The knowledge of what segmenter names are valid will still have to be built into the virtual table object. If an unknown document type is specified, the virtual table object will raise an exception on the insert.

This concept of using a table as an interface instead of as bulk storage is unsettling to some people at first. But the more you think about it, the more the concept will grow on you.

Documents can be removed from the index using

   DELETE FROM ex1 WHERE docid=:id

To search for documents, do a select with a WHERE clause constraint on the EX1.PATTERN column, like this:

   SELECT docid FROM ex1
    WHERE pattern='"George Washington" slept here';

The EX1.PATTERN column does not represent anything that is actually stored. This is just a way to specify the search pattern. If you query for the EX1.PATTERN column you will always get back a NULL. The EX1.PATTERN column is really only useful in WHERE clause constraints. In any other context it always appears to be NULL.

The string that EX1.PATTERN is compared against will be parsed by the indexer. In this case, we have the two keywords George and Washington in double-quotes, indicating that this is a phrase search. Only documents where these words occur together will match. Other kinds of search syntaxes can be used. It is up to the indexer implementation to parser and interpret the search pattern. To SQLite, the pattern is just a string.

The EX1 table has some unusual properties (such as the strange behavior of the PATTERN column) but in other respects it can be used like any other table. The EX1 table can appear in a JOIN or a subquery just like any other table. And you can create VIEWs on the EX1 table. But you cannot DROP the EX1 table or create additional indices on it.

Additional information about the search result is available from the SCORE and SNIPPET columns. The SCORE is some kind of score attached to the pattern match and SNIPPET is a context snippet for the match. These are not real columns. Both are computed only if requested. But they can be used like any other column. Consider this example:

   SELECT docinfo.title, ex1.snippet, docinfo.link
     FROM ex1 JOIN docinfo USING(docid)
    WHERE ex1.pattern='"George Washington" slept here'
      AND ex1.score>0.5
    ORDER BY docinfo.date DESC, ex1.score DESC
    LIMIT 20;

All of this is implemented using a the VirtualTables interface to SQLite.

Advantages and Disadvantages Of The Interface Proposal Above

Advantages:

The VirtualTable mechanism has been suggested before and has other uses outside of full-text search.
Because the full-text search is a generic add-on, we can experiment with different full-text search algorithms. Users can add their own algorithms simply by adding a new or modified virtual table. There is no need to hack on the SQLite core in order to substitute an alternative full-text search algorithm.

Disadvantages:

No other database engine that we know of does it this way.

sqlite - Full Text Index

Page History

Full-text indexing for SQLite Status: Draft (as of 25 Apr 2006)

Full-text indexing for SQLite
Status: Draft (as of 25 Apr 2006)