Full-text indexing for SQLite
Status: Draft (as of 25 Apr 2006)

*Objective* Design and implement a full-text indexing and query system to be incorporated into SQLite. It should provide a number of useful features, but it should strive for efficiency and simplicity rather than the all-encompassing generality offered by implementations such as CLucene. ---- *Background* *: Full-text indexing has also been incorporated into MySQL (http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html), PostgreSQL (http://techdocs.postgresql.org/techdocs/fulltextindexing.php), and Microsoft SQL Server (http://support.microsoft.com/default.aspx?scid=kb;%5BLN%5D;323739).*:Their interfaces and designs, where available, will likely be useful in planning Cinnamon. *: Interesting sets of test data include the Enron email dataset (http://www.cs.cmu.edu/~enron/), various TREC (Text Retrieval Conference) data sets (http://trec.nist.gov/data.html), the Calgary text-compression corpus (http://links.uwaterloo.ca/calgary.corpus.html), and assorted International English corpora from Cambridge (http://www.cambridge.org/elt/corpus/international_corpus.htm) ---- *Overview* This project comprises two primary services: indexing and querying. Broad goals include *Indexing* *: Fast indexing in (soft) bounded time and memory *: Small index size, primarily to improve speed by reducing disk I/O *: Integration with SQL and especially with SQL transactions *Querying* *: Google-like query language *: AND (perhaps implicitly), OR, NOT *: Phrase queries *: Field qualifiers (to:, from:, etc.) *: Support for contextual snippets, although the snippets themsleves may not be provided *: Sufficient integration with SQLite to allow complicated SELECT statements and table joins *: Adequate speed; extreme query speed is not a goal of the project, since indexing speed is a much higher priority in the expected uses. *: Relevance metrics are not a goal of the project, since relevance is highly application-dependent. ---- *Detailed Design* Many more issues must be considered before a more detailed design will be possible. Among them are the following: *Interface and hardware requirements* *: Text encoding and internationalization *: RAM, CPU, and disk usage *: OS support: Windows 9x/ME? *Core data structures and algorithms* *: API provided *: Trie (prefix tree: http://en.wikipedia.org/wiki/Trie) vs. B-tree (http://en.wikipedia.org/wiki/B-tree) vs. something else *: Unified or layered data *: Update scheme: based on expected usage patterns, the module should be designed for frequent incremental updates of the index, rather than occasional, monolithic index revisions. *: Compatibility with Lucene (http://lucene.apache.org/)/CLucene (http://clucene.sourceforge.net/) *: Indexing data stored in the SQLite table or index; in the SQLite database file but separately; or in an entirely separate file *: Text to be indexed required to be stored in a SQLite table, or allowed to be provided from some external source *: Synchronous or asynchronous operation *Features* Built in, available via a plug-in component, or ignored? *: Word segmentation rules *: Prefix matching *: Substring matching *: Case folding *: Combined case-sensitive and -insensitive matching *: Punctuation matching *: Stemming *: Stop words