*** 83,90 **** *: Stop words ---- ! *Proposed Interface* ! (2006-Apr-28) A full text index appears to the programmer to be a single table. This table is not directly realized on disk, however. --- 83,89 ---- *: Stop words ---- ! **Proposed Programming Interface For Full-Text Search In SQLite** A full text index appears to the programmer to be a single table. This table is not directly realized on disk, however. *************** *** 97,235 **** real backing store - the backing store just is not a direct realization of the index table. ! Assuming the full text index is named EX1 then the table ! appears to have the following definition: ! CREATE TABLE ex1( ! docid INTEGER PRIMARY KEY, ! doctext TEXT, ! score REAL, ! snippet TEXT, ! pattern TEXT ); ! There is a special API that creates this table. It is not ! created by executing a CREATE TABLE statement. The CREATE ! TABLE statement above is there to demonstrate the structure ! of the table, not to show how the table is created. ! ! The actual storage for the full text index might be in the ! original database file or it might be in auxiliary files. ! That decision is made in the code that implements the virtual ! table and is transparent to the user. ! ! Add documents to the index by issuing SQL statements like ! this: ! ! INSERT INTO ex1(doctxt) VALUES(:docbody); ! ! Because EX1.DOCID is an integer primary key, a new value ! for the document id is assigned automatically and is accessible ! using sqlite3_last_insert_rowid() in the usual way. ! ! The EX1.SCORE, EX1.SNIPPET, and EX1.PATTERN columns are ! intended for reading, not writting. Attempts to write ! anything into those columns (other than NULL) will result ! in a constraint error. ! ! The full text index may or may not store the original document ! text. If the original document is not stored, then attempts ! to read the EX1.DOCTEXT column will return a NULL. ! ! The text segmenter, stemmer, and the stop-word list are ! all built into the indexer. To use a different segmenter, ! stemmer, or stop-word list, register a different indexer ! object. If you have a situation where you what multiple ! segmenters to be used in a single index, then you can ! define an additional column in the virtual table that ! specifies which segmenter to use. For example: ! ! CREATE TABLE ex2( ! docid INTEGER PRIMARY KEY, ! doctext TEXT, ! doctype TEXT, -- Specifies which segmenter to use ! score REAL, ! snippet TEXT, ! pattern TEXT ! ); ! ! Then ! ! INSERT INTO ex2(doctype,doctext) VALUES('html', :docbody); ! The knowledge of what segmenter names are valid will ! still have to be built into the virtual table object. ! If an unknown document type is specified, the virtual ! table object will raise an exception on the insert. ! ! This concept of using a table as an interface instead of ! as bulk storage is unsettling to some people at first. But ! the more you think about it, the more the concept will grow ! on you. ! ! Documents can be removed from the index using ! ! DELETE FROM ex1 WHERE docid=:id ! ! To search for documents, do a select with a WHERE clause ! constraint on the EX1.PATTERN column, like this: ! ! SELECT docid FROM ex1 ! WHERE pattern='"George Washington" slept here'; ! ! The EX1.PATTERN column does not represent anything that ! is actually stored. This is just a way to specify the search ! pattern. If you query for the EX1.PATTERN column you will ! always get back a NULL. The EX1.PATTERN column is really ! only useful in WHERE clause constraints. In any other context ! it always appears to be NULL. ! ! The string that EX1.PATTERN is compared against will be ! parsed by the indexer. In this case, we have the two ! keywords George and Washington in double-quotes, indicating ! that this is a phrase search. Only documents where these ! words occur together will match. Other kinds of search ! syntaxes can be used. It is up to the indexer implementation ! to parser and interpret the search pattern. To SQLite, ! the pattern is just a string. ! ! The EX1 table has some unusual properties (such as the ! strange behavior of the PATTERN column) but in other respects ! it can be used like any other table. The EX1 table can appear ! in a JOIN or a subquery just like any other table. And you ! can create VIEWs on the EX1 table. But you cannot DROP the ! EX1 table or create additional indices on it. ! ! Additional information about the search result is available ! from the SCORE and SNIPPET columns. The SCORE is some kind ! of score attached to the pattern match and SNIPPET is a ! context snippet for the match. These are not real columns. ! Both are computed only if requested. But they can be used ! like any other column. Consider this example: ! ! SELECT docinfo.title, ex1.snippet, docinfo.link FROM ex1 JOIN docinfo USING(docid) ! WHERE ex1.pattern='"George Washington" slept here' ! AND ex1.score>0.5 ! ORDER BY docinfo.date DESC, ex1.score DESC ! LIMIT 20; ! ! All of this is implemented using a the VirtualTables interface ! to SQLite. ! ! *Advantages and Disadvantages Of The Interface Proposal Above* ! ! Advantages: ! ! *: The VirtualTables mechanism has been suggested before and has ! other uses outside of full-text search. ! ! *: Because the full-text search is a generic add-on, we can ! experiment with different full-text search algorithms. Users ! can add their own algorithms simply by adding a new or modified ! virtual table. There is no need to hack on the SQLite core in order ! to substitute an alternative full-text search algorithm. ! ! Disadvantages: ! *: No other database engine that we know of does it this way. --- 96,380 ---- real backing store - the backing store just is not a direct realization of the index table. ! A full text index is created by calling the following ! C API on an open database connection: ! int sqlite3fts1_create( ! sqlite3 *db, ! const char *zTableName, ! sqlite3fts1_stemmer *xStemmerFunc, ! const char **azSubfields, ! int flags ); ! The db parameter is the pointer to an open SQLite database ! in which the full text index already exists or is to be ! created. zTableName is the name of the virtual table that ! will represent the full text index. xStemmerFunc is a ! user-supplied function for segmenting and stemming the ! input text. The details of the interface to this stemmer ! are to be determined. If the xStemmerFunc parameter is NULL ! then a default stemmer is used. ! The azSubfields parameter, if not NULL, is a list of auxiliary ! fields associated with each document. The list is terminated ! by a NULL pointer. More on this subject in the sequel. ! The flags parameter is an ORed set of flags that control ! optional features of the full text index.</p> ! ! The sqlite3fts1_create() function creates a new virtual table ! named according to zTableName. It also creates, if they do not ! exist, several real tables and indices that serve as the backing ! store for the index. The real tables and indices are given names ! that are prefixed with zTableName. The real tables are intended ! for use by the virtual table only. Programmers should not attempt ! to make direct use of the real tables. The content of these ! backing-store tables is subject to change without notice and ! is not especially useful to the programmer. Direct modification ! of the backing-store tables will likely cause the full text ! index to malfunction. ! ! When sqlite3fts1_create() is called, a new full text index ! is created if it does not already exist. If the full text ! index does already exist then only the virtual table interface ! is created. When calling sqlite3fts1_create() for an existing ! full text index, one must use the exact same set of parameters ! that originally created the full text index. ! ! The virtual table that represents the full text index contains ! the columns listed below at a minimum. Additional columns might ! also be available depending on options. ! ! <html> ! <table border="0" cellspacing="15"> ! <tr> ! <td valign="top">docid</td> ! <td> ! A document id is assigned to each document as it is added ! to the index. Document ids are monotonically increasing ! and are limited to 32 bits. ! The docid column serves as the INTEGER PRIMARY KEY for the ! virtual table. After inserting a document, the id assigned ! to that document is available through the ! sqlite3_last_insert_rowid() API. ! The docid column is read-only. The docid is created ! automatically by the virtual table and cannot be modified ! by the application. ! </td> ! </tr> ! <tr> ! <td valign="top">doctext</td> ! <td> ! The complete text of a document is inserted into this ! column in order to add a document to the full text index. ! This is an insert-only column. Attempts to read this column ! will return NULL. Attempts to update this column will raise ! an exception. ! </td> ! </tr> ! <tr> ! <td valign="top">pattern</td> ! <td> ! This column is used as a constraint in the WHERE clause of ! a SELECT statement in order to invoke a full text search. ! Attempts to read this column in any context other than ! a WHERE clause will return NULL. Attempts to write this ! column will raise an exception. ! </td> ! </tr> ! <tr> ! <td valign="top">score</td> ! <td> ! When performing a full text search, this column returns ! a integer which reflects the quality of the current match. ! Higher numbers are better. ! </td> ! </tr> ! </table> ! ! <p>Suppose a full text index has been created and named "ex1". ! To insert a new document into the index, invoke the following SQL: ! </p> ! ! <blockquote><pre> ! INSERT INTO ex1(doctext) VALUES(:txt); ! SELECT last_insert_rowid(); ! </pre></blockquote> ! ! <p>Bind the ":txt" parameter to the complete text of the document ! to be inserted, of course. The INSERT statement will cause the ! full text indexer to read and segment the document and make ! appropriate entries in its backing-store tables to enable it ! to find this document again based on a keyword search of its ! content. A new document ID will be computed automatically and ! made available via the sqlite3_last_insert_rowid() interface.</p> ! ! <p> ! To remove a document from the full text index, use a DELETE statement ! as follows: ! </p> ! ! <blockquote><pre> ! DELETE FROM ex1 WHERE docid=:docid ! </pre></blockquote> ! ! <p>Bind the document ID to the :docid parameter before executing the ! above statement, of course.</p> ! ! <p>To find all documents that match a particular search pattern, ! use a SELECT statement as follows:</p> ! ! <blockquote><pre> ! SELECT docid, score FROM ex1 WHERE pattern=:pattern ! </pre></blockquote> ! ! <p>In this case, bind the :pattern parameter to the pattern of keywords ! you are searching for. The SELECT statement will return the document ID ! and a score for each matching document. The score will be an integer ! which is larger for better matches.</p> ! ! <p>Queries against the virtual ex1 table can appear in joins and ! in subqueries. For example, to sort the results by date, one might ! do a join as follows:</p> ! <blockquote><pre> ! SELECT docinfo.docname FROM ex1 JOIN docinfo USING(docid) ! WHERE pattern=:pattern ! ORDER BY docinfo.date DESC ! LIMIT 20 ! </pre></blockquote> ! ! <h3>Search Patterns</h3> ! ! <p>A search pattern is a white-space separated list of keywords. ! Each keyword is passed through the stemmer before being submitted ! to the search engine. Hence, if the stemmer does case folding then ! the search will be case insensitive and if the stemmer does not do ! case folding then the search will be case sensitive. (If you use ! a different stemmer to search than was used to build the index then ! the results will be unpredictable and probably incorrect.)</p> ! ! <p>Keywords separated by whitespace are assumed to be connected ! by a logical AND operator. You can also explicitly specify an ! AND if desired. The AND operator must be in all-caps. Keywords ! may also be separated by "OR" which means that documents will ! match that contain either the word to the left or the right of ! the "OR" (or both). The OR operator binds more tightly than ! AND. Use the "NOT" or a minus sign prefix to ! specify words which disqualify a document from consideration. ! Parentheses may be used to group operators.</p> ! ! <p>A sequence of keywords in double-quotes is a phrase. ! A phrase will match only if all keywords appear in the document ! in consecutive order.</p> ! ! <p>Glob characters "*", "?", and "[...]" can be used within ! keywords. The result is as if the search pattern had been ! expanded to include all words that match the glob pattern ! separated by OR. Be warned that wildcards at or near the beginning ! of a keyword or which match many words in the document lexicon ! can lead to long search times. Also note that wildcards might ! confuse the stemmer.</p> ! ! <p>Any non-alphanumeric characters other than those mentioned ! above are passed to the stemmer. If the stemmer chooses to ! ignore them then they do not enter into the search. But if ! the stemmer does handle or accept the characters, then they ! become part of the search keyword.</p> ! ! <p>Mismatched parentheses and other syntax errors in a search ! pattern are ignore. ! Search patterns never raise an exception no matter how badly formed. ! The full text search engine muddles through the best it can.</p> ! ! <h3>Options</h3> ! ! <p>The flags parameter to the sqlite3fts1_create() function ! is an ORed combination of the following flags:</p> ! ! <table border="0" cellspacing="15"> ! <tr> ! <td valign="top">SQLITE3FTS1_OFFSETS</td> ! <td> ! If present, the virtual table will contain a read-only column named ! "offsets" that will supply the byte-offsets in the original document ! text to the beginning of each matching word in a match result. ! </td> ! </tr> ! <tr> ! <td valign="top">SQLITE3FTS1_SNIPPET</td> ! <td> ! If present, the virtual table will contain a read-only column ! named "snippet" which will return a string showing matching ! words in context for each match. ! </td> ! </tr> ! </table> ! ! <h3>Subfields</h3> ! ! <p>The azSubfields parameter to sqlite3fts1_create() is a NULL-terminated ! list of pointers to strings where each string is the name of a ! subfield within the document text. If the azSubfields parameter is ! itself a NULL pointer or if it points to an empty list then subfields ! will not be supported by the full text index.</p> ! ! <p>If subfields are supported, then the virtual table contains ! additional columns for each subfield. If the name of a subfield ! is "xyz" then the new columns are named "xyz" and "xyz_boost".</p> ! ! <p>The "xyz" column is write-only and is used when a new document ! is added to the full text index. The "xyz" column is a text ! field contain a list of pairs of integer offsets into the body of the ! document. Each pair of offsets specifies the beginning and end ! of a substring in the document which are the contents of the ! specified field. The "xyz_boost" column is used in the WHERE clause ! of queries to specify multiplier for the score of keywords that ! are contained within the subfield.</p> ! ! <p>When performing a search on an index with subfields, one can ! restrict keywords to appear only in that subfield by prepending the ! name of the subfield and a colon to the keyword.</p> ! ! <p>For example, suppose the full text index is being used for an ! email application. The subfields might be "subject", "to", and "from". ! As each new email messages is being added to the index, strings containing ! integer pairs are inserted into the "subject", "to", and "from" columns ! that delimit the parts of the email message that contain the text of ! the subject, the recipients names and address, and the senders ! name and address.</p> ! ! <blockquote><pre> ! INSERT INTO emailidx(doctext,subject,to,from) ! VALUES(:doctext, :subject, :to, :from) ! </pre></blockquote> ! ! <p>In this example, the :doctext parameter is bound to the complete ! text of the email message as before. The :subject parameter might ! be bound to a string like "134 168" to indicate that the text of ! the subject line occurs between characters 134 and 168 of :doctext. ! Similar bindings on :to and :from specify the locations in :doctext ! where the recipients and sender can be found. If a subfield occurs ! in more than one place in the body of the document, then additional ! offsets can be added to the list.</p> ! ! <p>When querying, additional weight can be given to words that ! match the subject line, for example, by specifying a subject_boost ! in the WHERE clause:</p> ! ! <blockquote><pre> ! SELECT docid, score FROM emailidx ! WHERE pattern=:pattern AND subject_boost=2.5 ! </pre></blockquote> ! ! <p>The search string can contains keywords prefixed by "subject:" ! or "to:" or "from:" meaning that those keywords must occur within ! the specified subfields in order to match. For example, to find ! discussion of full text indexing on the SQLite mailing list, one ! might search for:</p> ! ! <blockquote> ! subject:sqlite "full text index" ! </blockquote> ! </html>