SQLite CVSTrac

*** 6,12 ****
  
  Average term length is pretty small (call it five), and the average prefix and suffix is obviously smaller than that.  In the first 64kdoc segment from loading Enron, almost all prefixes were 15 and under.  Since leaf nodes contain sorted data, the suffixes also tended to be small.
  
! So, we could encode nPrefix<16 and nSuffix<8 in a single byte.  If either is exceeded, we could fallback to no prefix and encode 128+nTerm (or, we could encode 0, then nPrefix and nSuffix).
  
  This is only of moderate value, because we'll only save a maximum of a byte per term.  But it's a free byte, and not too complicated at all to implement.  Additionally, awhile back I tested delta-encoding for interior nodes, and for most nodes it *increased* the space needed, because adjacent terms tended to have no shared prefix.  It did save about 2% for the 64kdoc segment's height-1 interior nodes.  This encoding might counter that.
  
--- 6,12 ----
  
  Average term length is pretty small (call it five), and the average prefix and suffix is obviously smaller than that.  In the first 64kdoc segment from loading Enron, almost all prefixes were 15 and under.  Since leaf nodes contain sorted data, the suffixes also tended to be small.
  
! So, we could encode nPrefix<16 and nSuffix<8 in a single byte.  If either is exceeded, we could fallback to no prefix and encode 128+nTerm (or, we could encode 0, then nPrefix and nSuffix).  (No, don't do the latter, the 0 wastes space.  If nPrefix is most likely of the two to exceed 128, encode nPrefix+128 then nSuffix, or vice versa.  That way we're more likely to need only 3 bytes.)
  
  This is only of moderate value, because we'll only save a maximum of a byte per term.  But it's a free byte, and not too complicated at all to implement.  Additionally, awhile back I tested delta-encoding for interior nodes, and for most nodes it *increased* the space needed, because adjacent terms tended to have no shared prefix.  It did save about 2% for the 64kdoc segment's height-1 interior nodes.  This encoding might counter that.

sqlite - Fts Two Notes