Small. Fast. Reliable.
Choose any three.
As the fts2 code was developed, I've been keeping notes about additional modifications to improve performance, or allow further scaling, or make things smaller. Some of the ideas are sort of crazy. What follows is mostly a brain-dump, and should not be taken to be specific development paths.


encode only the data needed to distinguish children in interior nodes

If we flush a leaf after "week", and the first term in the next leaf is "weekend", then the interior node would currently store "weekend". The trailing "nd" does not help distinguish things, though, so we could store just "weeke".

An additional trick would be to force a break when a leaf node is larger than a threshold, and we see a shared prefix of 0. Then we only need to encode the first letter of the next term in the interior node. If that tightened up the interior nodes enough, it would have potential to drop an entire layer from the tree, which would be a big win so long as it didn't create so many new leaf nodes that the sqlite btree added a layer.

This might also apply if there's only a single shared prefix byte. Or maybe there could be two thresholds.


encode nPrefix and nSuffix in a single byte

Average term length is pretty small (call it five), and the average prefix and suffix is obviously smaller than that. In the first 64kdoc segment from loading Enron, almost all prefixes were 15 and under. Since leaf nodes contain sorted data, the suffixes also tended to be small.

So, we could encode nPrefix<16 and nSuffix<8 in a single byte. If either is exceeded, we could fallback to no prefix and encode 128+nTerm (or, we could encode 0, then nPrefix and nSuffix). (No, don't do the latter, the 0 wastes space. If nPrefix is most likely of the two to exceed 128, encode nPrefix+128 then nSuffix, or vice versa. That way we're more likely to need only 3 bytes.)

This is only of moderate value, because we'll only save a maximum of a byte per term. But it's a free byte, and not too complicated at all to implement. Additionally, awhile back I tested delta-encoding for interior nodes, and for most nodes it increased the space needed, because adjacent terms tended to have no shared prefix. It did save about 2% for the 64kdoc segment's height-1 interior nodes. This encoding might counter that.


store interior nodes inline

Currently, for various reasons the leaf nodes must be a contiguous range of block ids. Thus, the interior nodes at height 1 are accumulated in memory, then flushed to the next range of blocks (and so on up the tree, thought height 1 is obviously most interesting). The memory footprint is N/C, with C being really rather large. The 64kdoc segment in the Enron corpus has only 85 interior blocks total, implying a memory footprint under 170k. But the 1mdoc segment is going to need a couple megabytes.

Since the blocks encode their height in the first byte, it would be easy to have segment merges skip interior nodes which were stored inline. This would only be of slight impact on merge performance, since interior blocks are only perhaps half a percent of total blocks. Then interior nodes could just be flushed as needed.

The problem is that interior nodes rely on their subtree nodes being contiguous. This would hold for the height 1 nodes (their leaf nodes would be contiguous), but the height 2 and above nodes would need to be encoded differently. The 64kdoc segment had 20k leaf nodes, implying that level-1 interior nodes would be about 230 blockids apart. Adding a delta-encoded blockid per term at level 2 and above would thus be perhaps 2 bytes per term, which might reduce the fanout there from 230 to 180 or so. That doesn't seem too bad.


smaller rowids in %_segments

The current system stores all blocks in a single %_segments table which is keyed by rowid. When we do a merge, the newly created blocks are necessarily receiving a higher rowid than the blocks the merged data is coming from, which blocks are eventually deleted. Since the new segment uses higher rowids, the deletes rowids cannot be reclaimed.

This can be finessed by using one segment table for segments in odd levels, another for even levels, and merging as soon as a level is full. For instance, if we have enough level-0 segments in the even table, we merge to a level-1 segment in the odd table, then delete the level-0 segments from the even table. Since they were at the highest rowids in the even table, those rowids can be reclaimed when we start generating more level-0 segments. The "merging as soon as full" clause means that when we merge level-1 segments into a level-2 segment, there won't be any higher rowids in the even table for level-0 segments.

This isn't very useful for segment encoding - we only refer to blockids in the segment dir, and at the front of each interior node, so even a large segment will only have a couple hundred blockids encoded. It might be more useful in increasing the btree density for the %_segments table. In experiments with the Enron corpus, it looks like the impact is about a 1/10 drop in max(rowid), plus another 1/2 from splitting across tables.