Page History

Full-text indexing for SQLite
Status: Draft (as of 25 Apr 2006)

Objective

Design and implement a full-text indexing and query system to be incorporated into SQLite. It should provide a number of useful features, but it should strive for efficiency and simplicity rather than the all-encompassing generality offered by implementations such as CLucene.

Background

Full-text indexing has also been incorporated into MySQL (http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html), PostgreSQL (http://techdocs.postgresql.org/techdocs/fulltextindexing.php), and Microsoft SQL Server (http://support.microsoft.com/default.aspx?scid=kb;%5BLN%5D;323739). Their interfaces and designs, where available, will likely be useful in planning the corresponding SQLite module.
Interesting sets of test data include the Enron email dataset (http://www.cs.cmu.edu/~enron/), various TREC (Text Retrieval Conference) data sets (http://trec.nist.gov/data.html), the Calgary text-compression corpus (http://links.uwaterloo.ca/calgary.corpus.html), and assorted International English corpora from Cambridge (http://www.cambridge.org/elt/corpus/international_corpus.htm).

Overview

This project comprises two primary services: indexing and querying. Broad goals include

Indexing

Fast indexing in (soft) bounded time and memory
Small index size, primarily to improve speed by reducing disk I/O
Integration with SQL and especially with SQL transactions

Querying

Google-like query language
AND (perhaps implicitly), OR, NOT
Phrase queries
Field qualifiers (to:, from:, etc.)
Support for contextual snippets, although the snippets themsleves may not be provided
Sufficient integration with SQLite to allow complicated SELECT statements and table joins
Adequate speed; extreme query speed is not a goal of the project, since indexing speed is a much higher priority in the expected uses.
Relevance metrics are not a goal of the project, since relevance is highly application-dependent.
Soundex style queries to allow fuzzy approximate matching

Detailed Design

Many more issues must be considered before a more detailed design will be possible. Among them are the following:

Interface and hardware requirements

Text encoding and internationalization
RAM, CPU, and disk usage
OS support: Windows 9x/ME?

Core data structures and algorithms

API provided
Trie (prefix tree: http://en.wikipedia.org/wiki/Trie) vs. B-tree (http://en.wikipedia.org/wiki/B-tree) vs. something else
Unified or layered data
Update scheme: based on expected usage patterns, the module should be designed for frequent incremental updates of the index, rather than occasional, monolithic index revisions.
Compatibility with Lucene (http://lucene.apache.org/) and CLucene (http://clucene.sourceforge.net/)
Indexing data stored in the SQLite table or index; in the SQLite database file but separately; or in an entirely separate file
Text to be indexed required to be stored in a SQLite table, or allowed to be provided from some external source
Synchronous or asynchronous operation

Features

Built in, available via a plug-in component, or ignored?

Word segmentation rules
Prefix matching
Substring matching
Case folding
Combined case-sensitive and -insensitive matching
Punctuation matching
Stemming
Stop words

Proposed Programming Interface For Full-Text Search In SQLite

A full text index appears to the programmer to be a single table. This table is not directly realized on disk, however. It is a virtual table - similar in concept to a VIEW except that it is implemented using C code rather than as a SELECT statement. See VirtualTables for more information. The full text index table is an interface to C routines that do the appropriate searching, not a materialized table. Obviously, there must be some kind of real backing store - the backing store just is not a direct realization of the index table.

A full text index is created by calling the following C API on an open database connection:

   int sqlite3fts1_create(
     sqlite3 *db,
     const char *zTableName,
     sqlite3fts1_stemmer *xStemmerFunc,
     const char **azSubfields,
     int flags
   );

The db parameter is the pointer to an open SQLite database in which the full text index already exists or is to be created. zTableName is the name of the virtual table that will represent the full text index. xStemmerFunc is a user-supplied function for segmenting and stemming the input text. The details of the interface to this stemmer are to be determined. If the xStemmerFunc parameter is NULL then a default stemmer is used. The azSubfields parameter, if not NULL, is a list of auxiliary fields associated with each document. The list is terminated by a NULL pointer. More on this subject in the sequel. The flags parameter is an ORed set of flags that control optional features of the full text index.</p>

The sqlite3fts1_create() function creates a new virtual table named according to zTableName. It also creates, if they do not exist, several real tables and indices that serve as the backing store for the index. The real tables and indices are given names that are prefixed with zTableName. The real tables are intended for use by the virtual table only. Programmers should not attempt to make direct use of the real tables. The content of these backing-store tables is subject to change without notice and is not especially useful to the programmer. Direct modification of the backing-store tables will likely cause the full text index to malfunction.

When sqlite3fts1_create() is called, a new full text index is created if it does not already exist. If the full text index does already exist then only the virtual table interface is created. When calling sqlite3fts1_create() for an existing full text index, one must use the exact same set of parameters that originally created the full text index.

The virtual table that represents the full text index contains the columns listed below at a minimum. Additional columns might also be available depending on options.

docid	A document id is assigned to each document as it is added to the index. Document ids are monotonically increasing and are limited to 32 bits. The docid column serves as the INTEGER PRIMARY KEY for the virtual table. After inserting a document, the id assigned to that document is available through the sqlite3_last_insert_rowid() API. The docid column is read-only. The docid is created automatically by the virtual table and cannot be modified by the application.
doctext	The complete text of a document is inserted into this column in order to add a document to the full text index. This is an insert-only column. Attempts to read this column will return NULL. Attempts to update this column will raise an exception.
pattern	This column is used as a constraint in the WHERE clause of a SELECT statement in order to invoke a full text search. Attempts to read this column in any context other than a WHERE clause will return NULL. Attempts to write this column will raise an exception.
score	When performing a full text search, this column returns a integer which reflects the quality of the current match. Higher numbers are better.

Suppose a full text index has been created and named "ex1". To insert a new document into the index, invoke the following SQL:

  INSERT INTO ex1(doctext) VALUES(:txt);
  SELECT last_insert_rowid();

Bind the ":txt" parameter to the complete text of the document to be inserted, of course. The INSERT statement will cause the full text indexer to read and segment the document and make appropriate entries in its backing-store tables to enable it to find this document again based on a keyword search of its content. A new document ID will be computed automatically and made available via the sqlite3_last_insert_rowid() interface.

To remove a document from the full text index, use a DELETE statement as follows:

   DELETE FROM ex1 WHERE docid=:docid

Bind the document ID to the :docid parameter before executing the above statement, of course.

To find all documents that match a particular search pattern, use a SELECT statement as follows:

   SELECT docid, score FROM ex1 WHERE pattern=:pattern

In this case, bind the :pattern parameter to the pattern of keywords you are searching for. The SELECT statement will return the document ID and a score for each matching document. The score will be an integer which is larger for better matches.

Queries against the virtual ex1 table can appear in joins and in subqueries. For example, to sort the results by date, one might do a join as follows:

   SELECT docinfo.docname
     FROM ex1 JOIN docinfo USING(docid)
    WHERE pattern=:pattern
    ORDER BY docinfo.date DESC
    LIMIT 20

Search Patterns

A search pattern is a white-space separated list of keywords. Each keyword is passed through the stemmer before being submitted to the search engine. Hence, if the stemmer does case folding then the search will be case insensitive and if the stemmer does not do case folding then the search will be case sensitive. (If you use a different stemmer to search than was used to build the index then the results will be unpredictable and probably incorrect.)

Keywords separated by whitespace are assumed to be connected by a logical AND operator. You can also explicitly specify an AND if desired. The AND operator must be in all-caps. Keywords may also be separated by "OR" which means that documents will match that contain either the word to the left or the right of the "OR" (or both). The OR operator binds more tightly than AND. Use the "NOT" or a minus sign prefix to specify words which disqualify a document from consideration. Parentheses may be used to group operators.

A sequence of keywords in double-quotes is a phrase. A phrase will match only if all keywords appear in the document in consecutive order.

Glob characters "*", "?", and "[...]" can be used within keywords. The result is as if the search pattern had been expanded to include all words that match the glob pattern separated by OR. Be warned that wildcards at or near the beginning of a keyword or which match many words in the document lexicon can lead to long search times. Also note that wildcards might confuse the stemmer.

Any non-alphanumeric characters other than those mentioned above are passed to the stemmer. If the stemmer chooses to ignore them then they do not enter into the search. But if the stemmer does handle or accept the characters, then they become part of the search keyword.

Mismatched parentheses and other syntax errors in a search pattern are ignore. Search patterns never raise an exception no matter how badly formed. The full text search engine muddles through the best it can.

Options

The flags parameter to the sqlite3fts1_create() function is an ORed combination of the following flags:

SQLITE3FTS1_OFFSETS	If present, the virtual table will contain a read-only column named "offsets" that will supply the byte-offsets in the original document text to the beginning of each matching word in a match result.
SQLITE3FTS1_SNIPPET	If present, the virtual table will contain a read-only column named "snippet" which will return a string showing matching words in context for each match.

Subfields

The azSubfields parameter to sqlite3fts1_create() is a NULL-terminated list of pointers to strings where each string is the name of a subfield within the document text. If the azSubfields parameter is itself a NULL pointer or if it points to an empty list then subfields will not be supported by the full text index.

If subfields are supported, then the virtual table contains additional columns for each subfield. If the name of a subfield is "xyz" then the new columns are named "xyz" and "xyz_boost".

The "xyz" column is write-only and is used when a new document is added to the full text index. The "xyz" column is a text field contain a list of pairs of integer offsets into the body of the document. Each pair of offsets specifies the beginning and end of a substring in the document which are the contents of the specified field. The "xyz_boost" column is used in the WHERE clause of queries to specify multiplier for the score of keywords that are contained within the subfield.

When performing a search on an index with subfields, one can restrict keywords to appear only in that subfield by prepending the name of the subfield and a colon to the keyword.

For example, suppose the full text index is being used for an email application. The subfields might be "subject", "to", and "from". As each new email messages is being added to the index, strings containing integer pairs are inserted into the "subject", "to", and "from" columns that delimit the parts of the email message that contain the text of the subject, the recipients names and address, and the senders name and address.

    INSERT INTO emailidx(doctext,subject,to,from)
     VALUES(:doctext, :subject, :to, :from)

In this example, the :doctext parameter is bound to the complete text of the email message as before. The :subject parameter might be bound to a string like "134 168" to indicate that the text of the subject line occurs between characters 134 and 168 of :doctext. Similar bindings on :to and :from specify the locations in :doctext where the recipients and sender can be found. If a subfield occurs in more than one place in the body of the document, then additional offsets can be added to the list.

When querying, additional weight can be given to words that match the subject line, for example, by specifying a subject_boost in the WHERE clause:

   SELECT docid, score FROM emailidx
    WHERE pattern=:pattern AND subject_boost=2.5

The search string can contains keywords prefixed by "subject:" or "to:" or "from:" meaning that those keywords must occur within the specified subfields in order to match. For example, to find discussion of full text indexing on the SQLite mailing list, one might search for:

subject:sqlite "full text index"

sqlite - Full Text Index

Page History

Full-text indexing for SQLite Status: Draft (as of 25 Apr 2006)

Search Patterns

Options

Subfields

Full-text indexing for SQLite
Status: Draft (as of 25 Apr 2006)