SQLite CVSTrac

*** 1,381 ****
  <html>
  <h1 align="center">
! Full-text indexing for SQLite<br>
! Status: Draft (as of 25 Apr 2006)<br>
  </h1>
  </html>
  
! *Objective*
  
! Design and implement a full-text indexing and query system to be incorporated into SQLite.
! It should provide a number of useful features, but it should strive for efficiency and
! simplicity rather than the all-encompassing generality offered by implementations such as
! CLucene.
! 
! ----
! 
! *Background*
! 
! *: Full-text indexing has also been incorporated into MySQL (http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html), PostgreSQL (http://techdocs.postgresql.org/techdocs/fulltextindexing.php), and Microsoft SQL Server (http://support.microsoft.com/default.aspx?scid=kb;%5BLN%5D;323739). Their interfaces and designs, where available, will likely be useful in planning the corresponding SQLite module.
! *: Interesting sets of test data include
! the Enron email dataset (http://www.cs.cmu.edu/~enron/),
! various TREC (Text Retrieval Conference) data sets (http://trec.nist.gov/data.html),
! the Calgary text-compression corpus (http://links.uwaterloo.ca/calgary.corpus.html), and
! assorted International English corpora from Cambridge (http://www.cambridge.org/elt/corpus/international_corpus.htm).
! 
! ----
! 
! *Overview*
! 
! This project comprises two primary services: indexing and querying. Broad goals include
! 
! *Indexing*
! 
! *: Fast indexing in (soft) bounded time and memory
! *: Small index size, primarily to improve speed by reducing disk I/O
! *: Integration with SQL and especially with SQL transactions
! 
! *Querying*
! 
! *: Google-like query language
! *: AND (perhaps implicitly), OR, NOT
! *: Phrase queries
! *: Field qualifiers (to:, from:, etc.)
! *: Support for contextual snippets, although the snippets themsleves may not be provided
! *: Sufficient integration with SQLite to allow complicated SELECT statements and table joins
! *: Adequate speed; extreme query speed is not a goal of the project, since indexing speed is a much higher priority in the expected uses.
! *: Relevance metrics are not a goal of the project, since relevance is highly application-dependent.
! *: Soundex style queries to allow fuzzy approximate matching
! 
! ----
! 
! *Detailed Design*
! 
! Many more issues must be considered before a more detailed design will be possible. Among them are the following:
! 
! *Interface and hardware requirements*
! 
! *: Text encoding and internationalization
! *: RAM, CPU, and disk usage
! *: OS support: Windows 9x/ME?
! 
! *Core data structures and algorithms*
! 
! *: API provided
! *: Trie (prefix tree: http://en.wikipedia.org/wiki/Trie) vs. B-tree (http://en.wikipedia.org/wiki/B-tree) vs. something else
! *: Unified or layered data
! *: Update scheme: based on expected usage patterns, the module should be designed for frequent incremental updates of the index, rather than occasional, monolithic index revisions.
! *: Compatibility with Lucene (http://lucene.apache.org/) and CLucene (http://clucene.sourceforge.net/)
! *: Indexing data stored in the SQLite table or index; in the SQLite database file but separately; or in an entirely separate file
! *: Text to be indexed required to be stored in a SQLite table, or allowed to be provided from some external source
! *: Synchronous or asynchronous operation
! 
! *Features*
! 
! Built in, available via a plug-in component, or ignored?
! 
! *: Word segmentation rules
! *: Prefix matching
! *: Substring matching
! *: Case folding
! *: Combined case-sensitive and -insensitive matching
! *: Punctuation matching
! *: Stemming
! *: Stop words
! 
! ----
! **Proposed Programming Interface For Full-Text Search In SQLite**
! 
! A full text index appears to the programmer to be a single
! table.  This table is not directly realized on disk, however.
! It is a virtual table - similar in concept to a VIEW except
! that it is implemented using C code rather than as a
! SELECT statement.  See VirtualTables for more information.
! The full text index table is an interface
! to C routines that do the appropriate searching, not a
! materialized table.  Obviously, there must be some kind of
! real backing store - the backing store just is not a direct
! realization of the index table.
! 
! A full text index is created by calling the following
! C API on an open database connection:
! 
!    int sqlite3fts1_create(
!      sqlite3 *db,
!      const char *zTableName,
!      sqlite3fts1_stemmer *xStemmerFunc,
!      const char **azSubfields,
!      int flags
!    );
! 
! The db parameter is the pointer to an open SQLite database
! in which the full text index already exists or is to be
! created.  zTableName is the name of the virtual table that
! will represent the full text index.  xStemmerFunc is a
! user-supplied function for segmenting and stemming the
! input text.  The details of the interface to this stemmer
! are to be determined.  If the xStemmerFunc parameter is NULL
! then a default stemmer is used.
! The azSubfields parameter, if not NULL, is a list of auxiliary
! fields associated with each document.  The list is terminated
! by a NULL pointer.  More on this subject in the sequel.
! The flags parameter is an ORed set of flags that control
! optional features of the full text index.</p>
! 
! The sqlite3fts1_create() function creates a new virtual table
! named according to zTableName.  It also creates, if they do not
! exist, several real tables and indices that serve as the backing
! store for the index.  The real tables and indices are given names
! that are prefixed with zTableName.  The real tables are intended
! for use by the virtual table only.  Programmers should not attempt
! to make direct use of the real tables.  The content of these
! backing-store tables is subject to change without notice and
! is not especially useful to the programmer.  Direct modification
! of the backing-store tables will likely cause the full text
! index to malfunction.
! 
! When sqlite3fts1_create() is called, a new full text index
! is created if it does not already exist.  If the full text
! index does already exist then only the virtual table interface
! is created.  When calling sqlite3fts1_create() for an existing
! full text index, one must use the exact same set of parameters
! that originally created the full text index.
! 
! The virtual table that represents the full text index contains
! the columns listed below at a minimum.  Additional columns might
! also be available depending on options.
! 
! <html>
! <table border="0" cellspacing="15">
! <tr>
! <td valign="top">docid</td>
! <td>
! A document id is assigned to each document as it is added
! to the index.  Document ids are monotonically increasing
! and are limited to 32 bits.
! The docid column serves as the INTEGER PRIMARY KEY for the
! virtual table.  After inserting a document, the id assigned
! to that document is available through the
! sqlite3_last_insert_rowid() API.
! The docid column is read-only.  The docid is created
! automatically by the virtual table and cannot be modified
! by the application.
! </td>
! </tr>
! <tr>
! <td valign="top">doctext</td>
! <td>
! The complete text of a document is inserted into this
! column in order to add a document to the full text index.
! This is an insert-only column.  Attempts to read this column
! will return NULL.  Attempts to update this column will raise
! an exception.
! </td>
! </tr>
! <tr>
! <td valign="top">pattern</td>
! <td>
! This column is used as a constraint in the WHERE clause of
! a SELECT statement in order to invoke a full text search.
! Attempts to read this column in any context other than
! a WHERE clause will return NULL.  Attempts to write this
! column will raise an exception.
! </td>
! </tr>
! <tr>
! <td valign="top">score</td>
! <td>
! When performing a full text search, this column returns
! a integer which reflects the quality of the current match.
! Higher numbers are better.
! </td>
! </tr>
! </table>
! 
! <p>Suppose a full text index has been created and named "ex1".
! To insert a new document into the index, invoke the following SQL:
! </p>
! 
! <blockquote><pre>
!   INSERT INTO ex1(doctext) VALUES(:txt);
!   SELECT last_insert_rowid();
! </pre></blockquote>
! 
! <p>Bind the ":txt" parameter to the complete text of the document
! to be inserted, of course.  The INSERT statement will cause the
! full text indexer to read and segment the document and make
! appropriate entries in its backing-store tables to enable it
! to find this document again based on a keyword search of its
! content.  A new document ID will be computed automatically and
! made available via the sqlite3_last_insert_rowid() interface.</p>
! 
! <p>
! To remove a document from the full text index, use a DELETE statement
! as follows:
! </p>
! 
! <blockquote><pre>
!    DELETE FROM ex1 WHERE docid=:docid
! </pre></blockquote>
! 
! <p>Bind the document ID to the :docid parameter before executing the
! above statement, of course.</p>
! 
! <p>To find all documents that match a particular search pattern,
! use a SELECT statement as follows:</p>
! 
! <blockquote><pre>
!    SELECT docid, score FROM ex1 WHERE pattern=:pattern
! </pre></blockquote>
! 
! <p>In this case, bind the :pattern parameter to the pattern of keywords
! you are searching for.  The SELECT statement will return the document ID
! and a score for each matching document.  The score will be an integer
! which is larger for better matches.</p>
! 
! <p>Queries against the virtual ex1 table can appear in joins and
! in subqueries.  For example, to sort the results by date, one might
! do a join as follows:</p>
! 
! <blockquote><pre>
!    SELECT docinfo.docname
!      FROM ex1 JOIN docinfo USING(docid)
!     WHERE pattern=:pattern
!     ORDER BY docinfo.date DESC
!     LIMIT 20
! </pre></blockquote>
! 
! <h3>Search Patterns</h3>
! 
! <p>A search pattern is a white-space separated list of keywords.
! Each keyword is passed through the stemmer before being submitted
! to the search engine.  Hence, if the stemmer does case folding then
! the search will be case insensitive and if the stemmer does not do
! case folding then the search will be case sensitive.  (If you use
! a different stemmer to search than was used to build the index then
! the results will be unpredictable and probably incorrect.)</p>
! 
! <p>Keywords separated by whitespace are assumed to be connected
! by a logical AND operator.  You can also explicitly specify an
! AND if desired.  The AND operator must be in all-caps.   Keywords
! may also be separated by "OR" which means that documents will
! match that contain either the word to the left or the right of
! the "OR" (or both). The OR operator binds more tightly than
! AND.  Use the "NOT" or a minus sign prefix to
! specify words which disqualify a document from consideration.
! Parentheses may be used to group operators.</p>
! 
! <p>A sequence of keywords in double-quotes is a phrase.
! A phrase will match only if all keywords appear in the document
! in consecutive order.</p>
! 
! <p>Glob characters "*", "?", and "[...]" can be used within
! keywords.  The result is as if the search pattern had been
! expanded to include all words that match the glob pattern
! separated by OR.  Be warned that wildcards at or near the beginning
! of a keyword or which match many words in the document lexicon
! can lead to long search times.  Also note that wildcards might
! confuse the stemmer.</p>
! 
! <p>Any non-alphanumeric characters other than those mentioned
! above are passed to the stemmer.  If the stemmer chooses to
! ignore them then they do not enter into the search.  But if
! the stemmer does handle or accept the characters, then they
! become part of the search keyword.</p>
! 
! <p>Mismatched parentheses and other syntax errors in a search
! pattern are ignore.
! Search patterns never raise an exception no matter how badly formed.
! The full text search engine muddles through the best it can.</p>
! 
! <h3>Options</h3>
! 
! <p>The flags parameter to the sqlite3fts1_create() function
! is an ORed combination of the following flags:</p>
! 
! <table border="0" cellspacing="15">
! <tr>
! <td valign="top">SQLITE3FTS1_OFFSETS</td>
! <td>
! If present, the virtual table will contain a read-only column named
! "offsets" that will supply the byte-offsets in the original document
! text to the beginning of each matching word in a match result.
! </td>
! </tr>
! <tr>
! <td valign="top">SQLITE3FTS1_SNIPPET</td>
! <td>
! If present, the virtual table will contain a read-only column
! named "snippet" which will return a string showing matching
! words in context for each match.
! </td>
! </tr>
! </table>
! 
! <h3>Subfields</h3>
! 
! <p>The azSubfields parameter to sqlite3fts1_create() is a NULL-terminated
! list of pointers to strings where each string is the name of a
! subfield within the document text.  If the azSubfields parameter is
! itself a NULL pointer or if it points to an empty list then subfields
! will not be supported by the full text index.</p>
! 
! <p>If subfields are supported, then the virtual table contains
! additional columns for each subfield.  If the name of a subfield
! is "xyz" then the new columns are named "xyz" and "xyz_boost".</p>
! 
! <p>The "xyz" column is write-only and is used when a new document
! is added to the full text index.  The "xyz" column is a text
! field contain a list of pairs of integer offsets into the body of the
! document.  Each pair of offsets specifies the beginning and end
! of a substring in the document which are the contents of the
! specified field.  The "xyz_boost" column is used in the WHERE clause
! of queries to specify multiplier for the score of keywords that
! are contained within the subfield.</p>
! 
! <p>When performing a search on an index with subfields, one can
! restrict keywords to appear only in that subfield by prepending the
! name of the subfield and a colon to the keyword.</p>
! 
! <p>For example, suppose the full text index is being used for an
! email application.  The subfields might be "subject", "to", and "from".
! As each new email messages is being added to the index, strings containing
! integer pairs are inserted into the "subject", "to", and "from" columns
! that delimit the parts of the email message that contain the text of
! the subject, the recipients names and address, and the senders
! name and address.</p>
! 
! <blockquote><pre>
!     INSERT INTO emailidx(doctext,subject,to,from)
!      VALUES(:doctext, :subject, :to, :from)
! </pre></blockquote>
! 
! <p>In this example, the :doctext parameter is bound to the complete
! text of the email message as before.  The :subject parameter might
! be bound to a string like "134 168" to indicate that the text of
! the subject line occurs between characters 134 and 168 of :doctext.
! Similar bindings on :to and :from specify the locations in :doctext
! where the recipients and sender can be found.  If a subfield occurs
! in more than one place in the body of the document, then additional
! offsets can be added to the list.</p>
! 
! <p>When querying, additional weight can be given to words that
! match the subject line, for example, by specifying a subject_boost
! in the WHERE clause:</p>
! 
! <blockquote><pre>
!    SELECT docid, score FROM emailidx
!     WHERE pattern=:pattern AND subject_boost=2.5
! </pre></blockquote>
! 
! <p>The search string can contains keywords prefixed by "subject:"
! or "to:" or "from:" meaning that those keywords must occur within
! the specified subfields in order to match.  For example, to find
! discussion of full text indexing on the SQLite mailing list, one
! might search for:</p>
! 
! <blockquote>
!   subject:sqlite "full text index"
! </blockquote>
  
! </html>
--- 1,130 ----
  <html>
  <h1 align="center">
! Full-text Search for SQLite<br>
! (as of 2006-08-23)<br>
  </h1>
  </html>
  
! This is preliminary documentation for a full-text indexing module for SQLite
  
! **Getting the full-text module**
  
! The full-text module is in the SQLite CVS tree in the ext/fts1 directory.
! Building
! 
! The module can be built either as a standalone shared library, or statically linked into the SQLite library.
! 
! *As a shared library*
! 
! To build the module as a shared library, compile all source files in the fts1 directory into a shared library (.so or .dll) on your platform.  (Sorry - there's no makefile checked in yet.  Coming soon.)
! 
! *Statically linked*
! 
! To statically link the module into SQLite, add all .c files from the fts1   directory to the Makefile you use to compile SQLite so that they will be linked into the SQLite image.  You must define the preprocessor symbol SQLITE_CORE when compiling these files.
! 
! **Initializing**
! 
! *As a shared library*
! 
! When the module is built as a shared library, you can load it into SQLite using the ".load" shell command.
! 
!   sqlite> .load fulltext
! 
! Or you can load it using a SELECT statement:
! 
!   SELECT load_extension('fulltext');
! 
! Note that you may need to call sqlite3_enable_load_extension before loading the extension; see the SQLite LoadableExtensions documentation.
! 
! *Statically linked*
! 
! In a static build, you must call the fulltext_init function to be able to use full-text indexing on any given database connection.  This function is defined in fulltext.h and has the following prototype:
! 
!   int fulltext_init(sqlite3 *db);
! 
! fulltext_init returns SQLITE_OK on success or an SQLite error code otherwise.
! 
! **Using**
! 
! Full-text tables store fully indexed text.  You can create a full-text table using the CREATE VIRTUAL TABLE statement:
! 
!   sqlite>create virtual table foo using fulltext;
! 
! A full-text table has a single column content of type text:
! 
!    sqlite>insert into foo (content) values ('to sleep perchance to dream');
!    sqlite>insert into foo (content) values ('to be or not to be');
! 
! Every row has a unique rowid, just as in any other SQLite table:
! 
!    sqlite>select rowid, * from foo;
!    1|to sleep perchance to dream
!    2|to be or not to be
!    sqlite>
! 
! The MATCH operator performs a full-text match on the content column in a full-text table:
! 
!    sqlite>select * from foo where content match 'dream';
!    to sleep perchance to dream
!    sqlite>
! 
! *Query language*
! 
! A query may contain multiple terms, in which case it will return only documents containing all of the terms:
! 
!    sqlite>select * from foo where content match 'dream sleep';
!    to sleep perchance to dream
!    sqlite>
! 
! Phrases may be enclosed in double quotes:
! 
!    sqlite>select * from foo where content match '"to dream" "to sleep"';
!    to sleep perchance to dream
!    sqlite>
! 
! *Joining full-text fields*
! 
! A full-text table stores only one text column.  To store records which conceptually have several full-text fields, create several full-text tables and join them together using record IDs.  For example, suppose that we'd like to store a set of email messages; each message has a sender, recipient, subject and body.  We'd like the subject and body to be full-text indexed.  We can use the following schema:
! 
!    create table email(sender text, recipient text, subject_id int, body_id int);
!    create virtual table email_subject using fulltext;
!    create virtual table email_body using fulltext;
! 
! To find the IDs of all messages containing the word "jam" we can issue a query joining the email and email_body tables:
! 
!    select email.rowid
!      from email join email_body on email.body_id = email_body.rowid
!     where content match 'jam';
! 
! *Tokenization*
! 
! As the module indexes a piece of text, it converts the text to a sequence of tokens.  Each token becomes a term in the index and can be matched using a full-text query.
! 
! The module currently uses the following generic tokenization mechanism.  A token is a contiguous sequence of alphanumeric ASCII characters (A-Z, a-z and 0-9).  All non-ASCII characters are ignored.  Each token is converted to lowercase before it is stored in the index, so all full-text searches are case-insensitive.  The module does not perform stemming of any sort.
! 
! Soon, we hope to allow applications to define their own tokenizers (we in fact already have a generic tokenizer mechanism in our code; we just have yet to expose it to the outside world).
! 
! *Performance*
! 
! There are two steps you can take to greatly improve performance when inserting documents into a full-text index.  First, you can set the synchronous pragma to OFF:
! 
!    sqlite>pragma synchronous = off;
! 
! In our testing we've found that this dramatically increases indexing speed.  Of course, you should study the SQLite documentation (see http://www.sqlite.org/pragma.html) to understand the safety/speed tradeoff which this pragma controls and to determine the setting which is right for your application.
! 
! Secondly, you can index more than one document per transaction.  In our testing, we've found throughput to be best when we index at least 50 or so documents in each transaction; this dramatically improves performance over the one-document-per-transaction case.
! 
! We're still in the process of assessing and improving performance.
! 
! **Caveats**
! 
! Please note that the full-text database format is subject to change at any time.  We are not planning to implement backward compatibility in updates in the near future, so new code releases may fail spectacularly with old databases.  Of course this will change at some future point once our data structures become more stable.  If backward compatibility is important for your application, let us know and we'll see what we can work out.
! 
! **Missing features**
! 
! The full-text module is still in an early development phase.  The following features are missing but hopefully coming soon:
! 
! *: We plan to support NOT queries (using the syntax "-foo"), OR queries (e.g. "cat OR dog") and prefix queries (e.g. "foo*").
! *: Applications will be able to specify custom tokenizers.
! *: It will be possible to update text in full-text tables (at the moment only inserts and deletes work).
! *: Full-text queries will return the character offset where each match occurred; this will allow callers to generate snippets conveniently.

sqlite - Full Text Index