Small. Fast. Reliable.
Choose any three.
As the fts2 code was developed, I've been keeping notes about additional modifications to improve performance, or allow further scaling, or make things smaller. Some of the ideas are sort of crazy. What follows is mostly a brain-dump, and should not be taken to be specific development paths.


smaller rowids in %_segments

The current system stores all blocks in a single %_segments table which is keyed by rowid. When we do a merge, the newly created blocks are necessarily receiving a higher rowid than the blocks the merged data is coming from, which blocks are eventually deleted. Since the new segment uses higher rowids, the deletes rowids cannot be reclaimed.

This can be finessed by using one segment table for segments in odd levels, another for even levels, and merging as soon as a level is full. For instance, if we have enough level-0 segments in the even table, we merge to a level-1 segment in the odd table, then delete the level-0 segments from the even table. Since they were at the highest rowids in the even table, those rowids can be reclaimed when we start generating more level-0 segments. The "merging as soon as full" clause means that when we merge level-1 segments into a level-2 segment, there won't be any higher rowids in the even table for level-0 segments.

This isn't very useful for segment encoding - we only refer to blockids in the segment dir, and at the front of each interior node, so even a large segment will only have a couple hundred blockids encoded. It might be more useful in increasing the btree density for the %_segments table. In experiments with the Enron corpus, it looks like the impact is about a 1/10 drop in max(rowid), plus another 1/2 from splitting across tables.