SQLite CVSTrac

*** 2,7 ****
--- 2,17 ----
  
  ----
  
+ *encode only the data needed to distinguish children in interior nodes*
+ 
+ If we flush a leaf after "week", and the first term in the next leaf is "weekend", then the interior node would currently store "weekend".  The trailing "nd" does not help distinguish things, though, so we could store just "weeke".
+ 
+ An additional trick would be to force a break when a leaf node is larger than a threshold, and we see a shared prefix of 0.  Then we only need to encode the first letter of the next term in the interior node.  If that tightened up the interior nodes enough, it would have potential to drop an entire layer from the tree, which would be a big win so long as it didn't create so many new leaf nodes that the sqlite btree added a layer.
+ 
+ This might also apply if there's only a single shared prefix byte.  Or maybe there could be two thresholds.
+ 
+ ----
+ 
  *encode nPrefix and nSuffix in a single byte*
  
  Average term length is pretty small (call it five), and the average prefix and suffix is obviously smaller than that.  In the first 64kdoc segment from loading Enron, almost all prefixes were 15 and under.  Since leaf nodes contain sorted data, the suffixes also tended to be small.

sqlite - Fts Two Notes