Small. Fast. Reliable.
Choose any three.
As the fts2 code was developed, I've been keeping notes about additional modifications to improve performance, or allow further scaling, or make things smaller. Some of the ideas are sort of crazy. What follows is mostly a brain-dump, and should not be taken to be specific development paths.


Break doclists for a term across multiple leaf nodes

After loading in the first 64kdocs of the Enron dataset, position and offsets, the "enron" doclist is 3M. If this were extended to 1M docs, that could be expected to run to 50M. That's not acceptable.

Large doclists could be written to leaf nodes by introducing a new type of height-1 interior node. After encoding the first term, it would switch to encoding docid deltas. These deltas would let you find the subtree which contains hits for a given docid for this term.

This complicates the segment merging code quite a bit, but would allow us to control the maximum amount of data we're willing to consider at once. If we put, say, 50k of doclist data per leaf, a single interior node might be able to span 10M docs or so.

Also, this would allow queries to target specific docid ranges, useful if the optimizer has an infrequent term and a very frequent term. (The current code can't say "What are the hits for 'the' in docid X", it has to read the entire 'the' doclist.)

Note that querying this ends up looking alot like doing a prefix query.


Overnight I ran a callgrind profile of loading 100k docs from Enron. Two interesting numbers:

  472,379,370,844  PROGRAM TOTALS
  91,726,422,328  fts2.c:getVarint
Ick. Adding up all of the related stuff, I estimate that about 30% of the total is in doclist decoding. That's actually good, it means there's room for improvement :-).

First, I have an improved merge strategy that converts an N^2 to an NlogN (by doing pairwise merges rather than using an accumulator). Both versions had a small constant, but it is an improvement. This drops about 1/4 of the getVarints (I believe the above already included this change).

Doing a full N-way merge function would additionally drop about half of the remaining overhead. That implies about 10%-15% gain.

Really interesting, though, would be to remove the need to actually decode. For instance, the current doclist elements are encoded like:

  varint(iDocid)
  position and offset data
  varint(POS_END)
This could instead be encoded as:

  varint(iDocid)
  varint(nPosData)
  position and offset data
For short position lists, this would result in identical space usage and always save decoding overhead. For long position lists, this would increase space usage a bit, but would have much greater savings in overhead. I estimate that this would cut another order of magnitude out of the decoding overhead.

Beyond that, we could internally block doclists. When they exceed a threshold size, we could code them as:

  varint(nData)
  varint(iEndDocidDelta)
  regular doclist encoding from here
iEndDocidDelta would be added to the first docid to get the last docid from the next nData bytes. In many cases, this information is enough to let us skip the block. With 1k blocks, this would cut another 2 orders of magnitude from the decoding overhead.


encode only the data needed to distinguish children in interior nodes

If we flush a leaf after "week", and the first term in the next leaf is "weekend", then the interior node would currently store "weekend". The trailing "nd" does not help distinguish things, though, so we could store just "weeke".

An additional trick would be to force a break when a leaf node is larger than a threshold, and we see a shared prefix of 0. Then we only need to encode the first letter of the next term in the interior node. If that tightened up the interior nodes enough, it would have potential to drop an entire layer from the tree, which would be a big win so long as it didn't create so many new leaf nodes that the sqlite btree added a layer.

This might also apply if there's only a single shared prefix byte. Or maybe there could be two thresholds.


encode nPrefix and nSuffix in a single byte

Average term length is pretty small (call it five), and the average prefix and suffix is obviously smaller than that. In the first 64kdoc segment from loading Enron, almost all prefixes were 15 and under. Since leaf nodes contain sorted data, the suffixes also tended to be small.

So, we could encode nPrefix<16 and nSuffix<8 in a single byte. If either is exceeded, we could fallback to no prefix and encode 128+nTerm (or, we could encode 0, then nPrefix and nSuffix). (No, don't do the latter, the 0 wastes space. If nPrefix is most likely of the two to exceed 128, encode nPrefix+128 then nSuffix, or vice versa. That way we're more likely to need only 3 bytes.)

This is only of moderate value, because we'll only save a maximum of a byte per term. But it's a free byte, and not too complicated at all to implement. Additionally, awhile back I tested delta-encoding for interior nodes, and for most nodes it increased the space needed, because adjacent terms tended to have no shared prefix. It did save about 2% for the 64kdoc segment's height-1 interior nodes. This encoding might counter that.


store interior nodes inline

Currently, for various reasons the leaf nodes must be a contiguous range of block ids. Thus, the interior nodes at height 1 are accumulated in memory, then flushed to the next range of blocks (and so on up the tree, thought height 1 is obviously most interesting). The memory footprint is N/C, with C being really rather large. The 64kdoc segment in the Enron corpus has only 85 interior blocks total, implying a memory footprint under 170k. But the 1mdoc segment is going to need a couple megabytes.

Since the blocks encode their height in the first byte, it would be easy to have segment merges skip interior nodes which were stored inline. This would only be of slight impact on merge performance, since interior blocks are only perhaps half a percent of total blocks. Then interior nodes could just be flushed as needed.

The problem is that interior nodes rely on their subtree nodes being contiguous. This would hold for the height 1 nodes (their leaf nodes would be contiguous), but the height 2 and above nodes would need to be encoded differently. The 64kdoc segment had 20k leaf nodes, implying that level-1 interior nodes would be about 230 blockids apart. Adding a delta-encoded blockid per term at level 2 and above would thus be perhaps 2 bytes per term, which might reduce the fanout there from 230 to 180 or so. That doesn't seem too bad.


smaller rowids in %_segments

The current system stores all blocks in a single %_segments table which is keyed by rowid. When we do a merge, the newly created blocks are necessarily receiving a higher rowid than the blocks the merged data is coming from, which blocks are eventually deleted. Since the new segment uses higher rowids, the deletes rowids cannot be reclaimed.

This can be finessed by using one segment table for segments in odd levels, another for even levels, and merging as soon as a level is full. For instance, if we have enough level-0 segments in the even table, we merge to a level-1 segment in the odd table, then delete the level-0 segments from the even table. Since they were at the highest rowids in the even table, those rowids can be reclaimed when we start generating more level-0 segments. The "merging as soon as full" clause means that when we merge level-1 segments into a level-2 segment, there won't be any higher rowids in the even table for level-0 segments.

This isn't very useful for segment encoding - we only refer to blockids in the segment dir, and at the front of each interior node, so even a large segment will only have a couple hundred blockids encoded. It might be more useful in increasing the btree density for the %_segments table. In experiments with the Enron corpus, it looks like the impact is about a 1/10 drop in max(rowid), plus another 1/2 from splitting across tables.