Small. Fast. Reliable.
Choose any three.

Full-text Search for SQLite
(as of 2006-08-23)

This is preliminary documentation for a full-text indexing module for SQLite

Getting the full-text module

The full-text module is in the SQLite CVS tree in the ext/fts1 directory. Building

The module can be built either as a standalone shared library, or statically linked into the SQLite library.

As a shared library

To build the module as a shared library, compile all source files in the fts1 directory into a shared library (.so or .dll) on your platform. (Sorry - there's no makefile checked in yet. Coming soon.)

Statically linked

To statically link the module into SQLite, add all .c files from the fts1 directory to the Makefile you use to compile SQLite so that they will be linked into the SQLite image. You must define the preprocessor symbol SQLITE_CORE when compiling these files.

Initializing

As a shared library

When the module is built as a shared library, you can load it into SQLite using the ".load" shell command.

  sqlite> .load fulltext

Or you can load it using a SELECT statement:

  SELECT load_extension('fulltext');

Note that you may need to call sqlite3_enable_load_extension before loading the extension; see the SQLite LoadableExtensions documentation.

Statically linked

In a static build, you must call the fulltext_init function to be able to use full-text indexing on any given database connection. This function is defined in fulltext.h and has the following prototype:

  int fulltext_init(sqlite3 *db);

fulltext_init returns SQLITE_OK on success or an SQLite error code otherwise.

Using

Full-text tables store fully indexed text. You can create a full-text table using the CREATE VIRTUAL TABLE statement:

  sqlite>create virtual table foo using fulltext;

A full-text table has a single column content of type text:

   sqlite>insert into foo (content) values ('to sleep perchance to dream');
   sqlite>insert into foo (content) values ('to be or not to be');

Every row has a unique rowid, just as in any other SQLite table:

   sqlite>select rowid, * from foo;
   1|to sleep perchance to dream
   2|to be or not to be
   sqlite>

The MATCH operator performs a full-text match on the content column in a full-text table:

   sqlite>select * from foo where content match 'dream';
   to sleep perchance to dream
   sqlite>

Query language

A query may contain multiple terms, in which case it will return only documents containing all of the terms:

   sqlite>select * from foo where content match 'dream sleep';
   to sleep perchance to dream
   sqlite>

Phrases may be enclosed in double quotes:

   sqlite>select * from foo where content match '"to dream" "to sleep"';
   to sleep perchance to dream
   sqlite>

Joining full-text fields

A full-text table stores only one text column. To store records which conceptually have several full-text fields, create several full-text tables and join them together using record IDs. For example, suppose that we'd like to store a set of email messages; each message has a sender, recipient, subject and body. We'd like the subject and body to be full-text indexed. We can use the following schema:

   create table email(sender text, recipient text, subject_id int, body_id int);
   create virtual table email_subject using fulltext;
   create virtual table email_body using fulltext;

To find the IDs of all messages containing the word "jam" we can issue a query joining the email and email_body tables:

   select email.rowid
     from email join email_body on email.body_id = email_body.rowid
    where content match 'jam';

Tokenization

As the module indexes a piece of text, it converts the text to a sequence of tokens. Each token becomes a term in the index and can be matched using a full-text query.

The module currently uses the following generic tokenization mechanism. A token is a contiguous sequence of alphanumeric ASCII characters (A-Z, a-z and 0-9). All non-ASCII characters are ignored. Each token is converted to lowercase before it is stored in the index, so all full-text searches are case-insensitive. The module does not perform stemming of any sort.

Soon, we hope to allow applications to define their own tokenizers (we in fact already have a generic tokenizer mechanism in our code; we just have yet to expose it to the outside world).

Performance

There are two steps you can take to greatly improve performance when inserting documents into a full-text index. First, you can set the synchronous pragma to OFF:

   sqlite>pragma synchronous = off;

In our testing we've found that this dramatically increases indexing speed. Of course, you should study the SQLite documentation (see http://www.sqlite.org/pragma.html) to understand the safety/speed tradeoff which this pragma controls and to determine the setting which is right for your application.

Secondly, you can index more than one document per transaction. In our testing, we've found throughput to be best when we index at least 50 or so documents in each transaction; this dramatically improves performance over the one-document-per-transaction case.

We're still in the process of assessing and improving performance.

Caveats

Please note that the full-text database format is subject to change at any time. We are not planning to implement backward compatibility in updates in the near future, so new code releases may fail spectacularly with old databases. Of course this will change at some future point once our data structures become more stable. If backward compatibility is important for your application, let us know and we'll see what we can work out.

Missing features

The full-text module is still in an early development phase. The following features are missing but hopefully coming soon: