*** 81,83 **** --- 81,218 ---- *: Punctuation matching *: Stemming *: Stop words + + ---- + *Proposed Interface* + (2006-Apr-28) + + A full text index appears to the programmer to be a single + table. This table is not directly realized on disk, however. + It is a virtual table - similar in concept to a VIEW except + that it is implemented using C code rather than as a + SELECT statement. See VirtualTables for more information. + The full text index table is an interface + to C routines that do the appropriate searching, not a + materialized table. Obviously, there must be some kind of + real backing store - the backing store just is not a direct + realization of the index table. + + Assuming the full text index is named EX1 then the table + appears to have the following definition: + + CREATE TABLE ex1( + docid INTEGER PRIMARY KEY, + doctext TEXT, + score REAL, + snippet TEXT, + pattern TEXT + ); + + There is a special API that creates this table. It is not + created by executing a CREATE TABLE statement. The CREATE + TABLE statement above is there to demonstrate the structure + of the table, not to show how the table is created. + + The actual storage for the full text index might be in the + original database file or it might be in auxiliary files. + That decision is made in the code that implements the virtual + table and is transparent to the user. + + Add documents to the index by issuing SQL statements like + this: + + INSERT INTO ex1(doctxt) VALUES(:docbody); + + Because EX1.DOCID is an integer primary key, a new value + for the document id is assigned automatically and is accessible + using sqlite3_last_insert_rowid() in the usual way. + + The EX1.SCORE, EX1.SNIPPET, and EX1.PATTERN columns are + intended for reading, not writting. Attempts to write + anything into those columns (other than NULL) will result + in a constraint error. + + The full text index may or may not store the original document + text. If the original document is not stored, then attempts + to read the EX1.DOCTEXT column will return a NULL. + + The text segmenter, stemmer, and the stop-word list are + all built into the indexer. To use a different segmenter, + stemmer, or stop-word list, register a different indexer + object. If you have a situation where you what multiple + segmenters to be used in a single index, then you can + define an additional column in the virtual table that + specifies which segmenter to use. For example: + + CREATE TABLE ex2( + docid INTEGER PRIMARY KEY, + doctext TEXT, + doctype TEXT, -- Specifies which segmenter to use + score REAL, + snippet TEXT, + pattern TEXT + ); + + Then + + INSERT INTO ex2(doctype,doctext) VALUES('html', :docbody); + + The knowledge of what segmenter names are valid will + still have to be built into the virtual table object. + If an unknown document type is specified, the virtual + table object will raise an exception on the insert. + + This concept of using a table as an interface instead of + as bulk storage is unsettling to some people at first. But + the more you think about it, the more the concept will grow + on you. + + Documents can be removed from the index using + + DELETE FROM ex1 WHERE docid=:id + + To search for documents, do a select with a WHERE clause + constraint on the EX1.PATTERN column, like this: + + SELECT docid FROM ex1 + WHERE pattern='"George Washington" slept here'; + + The EX1.PATTERN column does not represent anything that + is actually stored. This is just a way to specify the search + pattern. If you query for the EX1.PATTERN column you will + always get back a NULL. The EX1.PATTERN column is really + only useful in WHERE clause constraints. In any other context + it always appears to be NULL. + + The string that EX1.PATTERN is compared against will be + parsed by the indexer. In this case, we have the two + keywords George and Washington in double-quotes, indicating + that this is a phrase search. Only documents where these + words occur together will match. Other kinds of search + syntaxes can be used. It is up to the indexer implementation + to parser and interpret the search pattern. To SQLite, + the pattern is just a string. + + The EX1 table has some unusual properties (such as the + strange behavior of the PATTERN column) but in other respects + it can be used like any other table. The EX1 table can appear + in a JOIN or a subquery just like any other table. And you + can create VIEWs on the EX1 table. But you cannot DROP the + EX1 table or create additional indices on it. + + Additional information about the search result is available + from the SCORE and SNIPPET columns. The SCORE is some kind + of score attached to the pattern match and SNIPPET is a + context snippet for the match. These are not real columns. + Both are computed only if requested. But they can be used + like any other column. Consider this example: + + SELECT docinfo.title, ex1.snippet, docinfo.link + FROM ex1 JOIN docinfo USING(docid) + WHERE ex1.pattern='"George Washington" slept here' + AND ex1.score>0.5 + ORDER BY docinfo.date DESC, ex1.score DESC + LIMIT 20; + + All of this is implemented using a the VirtualTables interface + to SQLite.