SQLite CVSTrac

*** 1,6 ****
  **Introduction**
  
! The module *fts2* is a new version of the SQLite FullTextIndex module.  At this time, usage is identical to *fts1*.  The data storage has been radically changed in a way which greatly improves insertion speed, at minor cost to query performance.  *fts2* is in the SQLite CVS tree at =ext/fts2= (/ext/fts2/fts2.c).
  
  In general, building, linking, and otherwise using *fts2* is identical to *fts1*, and will not be covered, here.
  
--- 1,6 ----
  **Introduction**
  
! The module *fts2* is a new version of the SQLite FullTextIndex module.  At this time, usage is identical to *fts1*.  The data storage has been radically changed in a way which greatly improves insertion speed, at minor cost to query performance (which may yet become "at no cost"!).  *fts2* is in the SQLite CVS tree at =ext/fts2= (/ext/fts2/fts2.c).
  
  In general, building, linking, and otherwise using *fts2* is identical to *fts1*, and will not be covered, here.
  
***************
*** 12,21 ****
  
  *Performance*
  
! *fts1* used a scheme of segmenting tokenized data into per-term doclists which were stored in a =%_terms= table.  The segmentation allowed newer data to be stored in smaller doclists in order to control update costs.
  
! *fts2* changes to a scheme where data is segmented into document groups, and updates occur by merging document groups together.  New documents form singleton segment, which is stored as a series of blobs, which is very cheap.  *fts1* required a read-modify-update pass on hundreds (or thousands) of term doclists, which became very expensive as the number of documents grew.  Segments are periodically merged together to moderate the query cost (which rises proportional to the number of segments).  Term data within a segment is grouped together and written in sorted fashion, making segment merges reasonably cheap.
  
  In a test which loads the Enron email corpus (1.4G of data across 517,431 documents), *fts1* required 13.5 hours, while *fts2* required 35 minutes.  This was with pagesize=4096, synchronous=off, 100 inserts per transaction.
  
! See the top-of-file comment in =fts2.c= for an in-depth description of how things work.
--- 12,21 ----
  
  *Performance*
  
! *fts1* uses a scheme of segmenting tokenized data into per-term doclists which were stored in a =%_terms= table.  The segmentation allowed newer data to be stored in smaller doclists in order to control update costs.  Unfortunately, it means that document inserts at minimum require read-modify-write against every term in the document, plus operations to merge term segments that grow too large.  Importantly, this meant that the cost to insert a new document was proportional to the number of tokens in the new document multipled by the number of tokens already in the database.
  
! *fts2* changes to a scheme where data is segmented into document groups, and updates occur by merging document groups together.  New documents form singleton segment, stored as a series of blobs, which is very cheap and does not degrade (much) over time.  Segments are periodically merged together to moderate the query cost (which rises proportional to the number of segments).  Term data within a segment is grouped together and written in sorted fashion, making segment merges reasonably cheap.
  
  In a test which loads the Enron email corpus (1.4G of data across 517,431 documents), *fts1* required 13.5 hours, while *fts2* required 35 minutes.  This was with pagesize=4096, synchronous=off, 100 inserts per transaction.
  
! See the top-of-file comment in =fts2.c= for an in-depth description of how things operate at this time.

sqlite - Fts Two