Page History
- 2007-Sep-03 18:13 anonymous
- 2006-Oct-12 23:41 shess
- 2006-Sep-29 21:16 anonymous
- 2006-Sep-24 23:34 anonymous
- 2006-Sep-20 19:41 adamd
- 2006-Sep-19 22:40 adamd
- 2006-Sep-19 22:34 adamd
- 2006-Aug-24 00:47 drh
- 2006-May-18 21:23 anonymous
- 2006-May-05 16:21 drh
- 2006-Apr-28 17:01 anonymous
- 2006-Apr-28 15:45 drh
- 2006-Apr-28 15:37 drh
- 2006-Apr-26 01:48 anonymous
- 2006-Apr-26 01:42 drh
Full-text indexing for SQLite
Status: Draft (as of 25 Apr 2006)
Design and implement a full-text indexing and query system to be incorporated into SQLite. It should provide a number of useful features, but it should strive for efficiency and simplicity rather than the all-encompassing generality offered by implementations such as CLucene.
Background
- Full-text indexing has also been incorporated into MySQL (http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html), PostgreSQL (http://techdocs.postgresql.org/techdocs/fulltextindexing.php), and Microsoft SQL Server (http://support.microsoft.com/default.aspx?scid=kb;%5BLN%5D;323739). Their interfaces and designs, where available, will likely be useful in planning the corresponding SQLite module.
- Interesting sets of test data include the Enron email dataset (http://www.cs.cmu.edu/~enron/), various TREC (Text Retrieval Conference) data sets (http://trec.nist.gov/data.html), the Calgary text-compression corpus (http://links.uwaterloo.ca/calgary.corpus.html), and assorted International English corpora from Cambridge (http://www.cambridge.org/elt/corpus/international_corpus.htm).
Overview
This project comprises two primary services: indexing and querying. Broad goals include
Indexing
- Fast indexing in (soft) bounded time and memory
- Small index size, primarily to improve speed by reducing disk I/O
- Integration with SQL and especially with SQL transactions
Querying
- Google-like query language
- AND (perhaps implicitly), OR, NOT
- Phrase queries
- Field qualifiers (to:, from:, etc.)
- Support for contextual snippets, although the snippets themsleves may not be provided
- Sufficient integration with SQLite to allow complicated SELECT statements and table joins
- Adequate speed; extreme query speed is not a goal of the project, since indexing speed is a much higher priority in the expected uses.
- Relevance metrics are not a goal of the project, since relevance is highly application-dependent.
Detailed Design
Many more issues must be considered before a more detailed design will be possible. Among them are the following:
Interface and hardware requirements
- Text encoding and internationalization
- RAM, CPU, and disk usage
- OS support: Windows 9x/ME?
Core data structures and algorithms
- API provided
- Trie (prefix tree: http://en.wikipedia.org/wiki/Trie) vs. B-tree (http://en.wikipedia.org/wiki/B-tree) vs. something else
- Unified or layered data
- Update scheme: based on expected usage patterns, the module should be designed for frequent incremental updates of the index, rather than occasional, monolithic index revisions.
- Compatibility with Lucene (http://lucene.apache.org/) and CLucene (http://clucene.sourceforge.net/)
- Indexing data stored in the SQLite table or index; in the SQLite database file but separately; or in an entirely separate file
- Text to be indexed required to be stored in a SQLite table, or allowed to be provided from some external source
- Synchronous or asynchronous operation
Features
Built in, available via a plug-in component, or ignored?
- Word segmentation rules
- Prefix matching
- Substring matching
- Case folding
- Combined case-sensitive and -insensitive matching
- Punctuation matching
- Stemming
- Stop words