Small. Fast. Reliable.
Choose any three.

Full-text indexing for SQLite
Status: Draft (as of 25 Apr 2006)

Objective

Design and implement a full-text indexing and query system to be incorporated into SQLite. It should provide a number of useful features, but it should strive for efficiency and simplicity rather than the all-encompassing generality offered by implementations such as CLucene.


Background


Overview

This project comprises two primary services: indexing and querying. Broad goals include

Indexing

Querying


Detailed Design

Many more issues must be considered before a more detailed design will be possible. Among them are the following:

Interface and hardware requirements

Core data structures and algorithms

Features

Built in, available via a plug-in component, or ignored?


Proposed Interface (2006-Apr-28)

A full text index appears to the programmer to be a single table. This table is not directly realized on disk, however. It is a virtual table - similar in concept to a VIEW except that it is implemented using C code rather than as a SELECT statement. See VirtualTables for more information. The full text index table is an interface to C routines that do the appropriate searching, not a materialized table. Obviously, there must be some kind of real backing store - the backing store just is not a direct realization of the index table.

Assuming the full text index is named EX1 then the table appears to have the following definition:

   CREATE TABLE ex1(
     docid INTEGER PRIMARY KEY,
     doctext TEXT,
     score REAL,
     snippet TEXT,
     pattern TEXT
   );

There is a special API that creates this table. It is not created by executing a CREATE TABLE statement. The CREATE TABLE statement above is there to demonstrate the structure of the table, not to show how the table is created.

The actual storage for the full text index might be in the original database file or it might be in auxiliary files. That decision is made in the code that implements the virtual table and is transparent to the user.

Add documents to the index by issuing SQL statements like this:

   INSERT INTO ex1(doctxt) VALUES(:docbody);

Because EX1.DOCID is an integer primary key, a new value for the document id is assigned automatically and is accessible using sqlite3_last_insert_rowid() in the usual way.

The EX1.SCORE, EX1.SNIPPET, and EX1.PATTERN columns are intended for reading, not writting. Attempts to write anything into those columns (other than NULL) will result in a constraint error.

The full text index may or may not store the original document text. If the original document is not stored, then attempts to read the EX1.DOCTEXT column will return a NULL.

The text segmenter, stemmer, and the stop-word list are all built into the indexer. To use a different segmenter, stemmer, or stop-word list, register a different indexer object. If you have a situation where you what multiple segmenters to be used in a single index, then you can define an additional column in the virtual table that specifies which segmenter to use. For example:

   CREATE TABLE ex2(
     docid INTEGER PRIMARY KEY,
     doctext TEXT,
     doctype TEXT,  -- Specifies which segmenter to use
     score REAL,
     snippet TEXT,
     pattern TEXT
   );

Then

   INSERT INTO ex2(doctype,doctext) VALUES('html', :docbody);

The knowledge of what segmenter names are valid will still have to be built into the virtual table object. If an unknown document type is specified, the virtual table object will raise an exception on the insert.

This concept of using a table as an interface instead of as bulk storage is unsettling to some people at first. But the more you think about it, the more the concept will grow on you.

Documents can be removed from the index using

   DELETE FROM ex1 WHERE docid=:id

To search for documents, do a select with a WHERE clause constraint on the EX1.PATTERN column, like this:

   SELECT docid FROM ex1
    WHERE pattern='"George Washington" slept here';

The EX1.PATTERN column does not represent anything that is actually stored. This is just a way to specify the search pattern. If you query for the EX1.PATTERN column you will always get back a NULL. The EX1.PATTERN column is really only useful in WHERE clause constraints. In any other context it always appears to be NULL.

The string that EX1.PATTERN is compared against will be parsed by the indexer. In this case, we have the two keywords George and Washington in double-quotes, indicating that this is a phrase search. Only documents where these words occur together will match. Other kinds of search syntaxes can be used. It is up to the indexer implementation to parser and interpret the search pattern. To SQLite, the pattern is just a string.

The EX1 table has some unusual properties (such as the strange behavior of the PATTERN column) but in other respects it can be used like any other table. The EX1 table can appear in a JOIN or a subquery just like any other table. And you can create VIEWs on the EX1 table. But you cannot DROP the EX1 table or create additional indices on it.

Additional information about the search result is available from the SCORE and SNIPPET columns. The SCORE is some kind of score attached to the pattern match and SNIPPET is a context snippet for the match. These are not real columns. Both are computed only if requested. But they can be used like any other column. Consider this example:

   SELECT docinfo.title, ex1.snippet, docinfo.link
     FROM ex1 JOIN docinfo USING(docid)
    WHERE ex1.pattern='"George Washington" slept here'
      AND ex1.score>0.5
    ORDER BY docinfo.date DESC, ex1.score DESC
    LIMIT 20;

All of this is implemented using a the VirtualTables interface to SQLite.

Advantages and Disadvantages Of The Interface Proposal Above

Advantages:

Disadvantages: