SQLite CVSTrac

*** 2,7 ****
--- 2,21 ----
  
  ----
  
+ *Break doclists for a term across multiple leaf nodes*
+ 
+ After loading in the first 64kdocs of the Enron dataset, position and offsets, the "enron" doclist is 3M.  If this were extended to 1M docs, that could be expected to run to 50M.  That's not acceptable.
+ 
+ Large doclists could be written to leaf nodes by introducing a new type of height-1 interior node.  After encoding the first term, it would switch to encoding docid deltas.  These deltas would let you find the subtree which contains hits for a given docid for this term.
+ 
+ This complicates the segment merging code quite a bit, but would allow us to control the maximum amount of data we're willing to consider at once.  If we put, say, 50k of doclist data per leaf, a single interior node might be able to span 10M docs or so.
+ 
+ Also, this would allow queries to target specific docid ranges, useful if the optimizer has an infrequent term and a very frequent term.  (The current code can't say "What are the hits for 'the' in docid X", it has to read the entire 'the' doclist.)
+ 
+ Note that querying this ends up looking alot like doing a prefix query.
+ 
+ ----
+ 
  Overnight I ran a callgrind profile of loading 100k docs from Enron.  Two interesting numbers:
  
  <html><pre>

sqlite - Fts Two Notes