Page History

Full-text indexing for SQLite
Status: Draft (as of 25 Apr 2006)

Objective

Design and implement a full-text indexing and query system to be incorporated into SQLite. It should provide a number of useful features, but it should strive for efficiency and simplicity rather than the all-encompassing generality offered by implementations such as CLucene.

Background

Full-text indexing has also been incorporated into MySQL (http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html), PostgreSQL (http://techdocs.postgresql.org/techdocs/fulltextindexing.php), and Microsoft SQL Server (http://support.microsoft.com/default.aspx?scid=kb;%5BLN%5D;323739). Their interfaces and designs, where available, will likely be useful in planning the corresponding SQLite module.
Interesting sets of test data include the Enron email dataset (http://www.cs.cmu.edu/~enron/), various TREC (Text Retrieval Conference) data sets (http://trec.nist.gov/data.html), the Calgary text-compression corpus (http://links.uwaterloo.ca/calgary.corpus.html), and assorted International English corpora from Cambridge (http://www.cambridge.org/elt/corpus/international_corpus.htm).

Overview

This project comprises two primary services: indexing and querying. Broad goals include

Indexing

Fast indexing in (soft) bounded time and memory
Small index size, primarily to improve speed by reducing disk I/O
Integration with SQL and especially with SQL transactions

Querying

Google-like query language
AND (perhaps implicitly), OR, NOT
Phrase queries
Field qualifiers (to:, from:, etc.)
Support for contextual snippets, although the snippets themsleves may not be provided
Sufficient integration with SQLite to allow complicated SELECT statements and table joins
Adequate speed; extreme query speed is not a goal of the project, since indexing speed is a much higher priority in the expected uses.
Relevance metrics are not a goal of the project, since relevance is highly application-dependent.

Detailed Design

Many more issues must be considered before a more detailed design will be possible. Among them are the following:

Interface and hardware requirements

Text encoding and internationalization
RAM, CPU, and disk usage
OS support: Windows 9x/ME?

Core data structures and algorithms

API provided
Trie (prefix tree: http://en.wikipedia.org/wiki/Trie) vs. B-tree (http://en.wikipedia.org/wiki/B-tree) vs. something else
Unified or layered data
Update scheme: based on expected usage patterns, the module should be designed for frequent incremental updates of the index, rather than occasional, monolithic index revisions.
Compatibility with Lucene (http://lucene.apache.org/) and CLucene (http://clucene.sourceforge.net/)
Indexing data stored in the SQLite table or index; in the SQLite database file but separately; or in an entirely separate file
Text to be indexed required to be stored in a SQLite table, or allowed to be provided from some external source
Synchronous or asynchronous operation

Features

Built in, available via a plug-in component, or ignored?

Word segmentation rules
Prefix matching
Substring matching
Case folding
Combined case-sensitive and -insensitive matching
Punctuation matching
Stemming
Stop words

sqlite - Full Text Index

Page History

Full-text indexing for SQLite Status: Draft (as of 25 Apr 2006)

Full-text indexing for SQLite
Status: Draft (as of 25 Apr 2006)