Reduce overhead of bulk inserts
As a test, I modified fts2.c to collect multiple inserts with increasing rowids into in-memory structures, then flush them out as a single level-0 segment. When it flushes every 100 documents, performance is increased by up to 40%. This is even better than the gains from doing bulk inserts in the first place.
Prefix-encode first terms
At this time, interior and leaf nodes fully-encode their first terms, and prefix-encode later terms. If we passed the key which decided which child to take down from the interior node, we could prefix-encode the first term against that. Thus, if we have runs of very long terms, we could save the initial hit from fully-encoding the term.
The main flaw is handling the leftmost term. The first child of an interior node has no term in the interior node. For interior nodes which are themselves not leftmost in their parents, the term would be the one passed down from the parent. But for the leftmost arc in the tree, there is no such term.
A solution would be to store the leftmost term of the tree at the root (in the segment directory). If this term is large, it could cause fragmentation of the segment directory, which would be negative for the entire tree. A solution to that would be to cap the length. Either way, it's possible/likely that most of the term will never be used in the first place, because it won't share much prefix with the first term of the next leaf node (and even less as you rise through the tree).
The best solution I've thought of would be to push the shared prefix from the first two terms of a node up. Long terms would thus be equitably distributed across the leftmost arc of the tree. This has the downside of making encoding much more complicated, since you can't encode the first term until you know the shared prefix from the parent node, etc. Probably not worthwhile at this time, but may be worthwhile if/when we start using a separate term directory (see a later idea on this list).
Integrate fts2 storage with sqlite btree
Convert %_segments to have fields for level, idx, and term0, and pull term0 out of the leaf nodes. In this form, interior nodes dissappear entirely, as does the segment directory. Segment merges would be pretty much unchanged. This could be a substantial improvement when implementing incremental merging, since it removes a layer of management from the system. Main concern is whether the additional btree will slow things down too much - the test immediately before fts2 used parallel btrees, and was still 10x slower than fts2 ended up.
Base docids per segment
Delta-encoding of docids kicks in strongly for longer doclists, but doesn't help much for short doclists. In the limit, a single-document segment for a high docid will spend much more overhead to encode identical docids. A solution would be to have a per-segment base docid which all doclists in that segment delta-encode from. For a single-document segment, this would make all leading docids 0, which could then be omitted.
This has greater impact on segments with few documents. As segments grow in size, the average delta of leading docids from the base docid will increase. Also, this gain is naturally limitted by how many doclists are in the segment in the first place. It's not entirely clear that this would be worth doing, given the complexities involved.
A variant of this idea would be to use segment-local docids. This has the additional advantage that the mapping could order documents descending by their term counts, so that documents with many terms would naturally encode with fewer varint bytes.
Term directory for frequency and other info
At some point, we're likely to add certain per-term information so that we can optimize queries. For instance, we might like to order AND merges from smallest doclist to largest. This may also allow for more efficient encoding. The term dictionary could be encoded similar to the current btree, but the delta-encoding should work better due to having longer runs of terms together. The leaf-node encoding would only need to store the termid of the first term, and the others can be assumed to increment, meaning leaves would store straight doclist data.
The gain from this is probably minimal, because it only helps if leaf nodes are very frequently broken, requiring full terms to be encoded. That doesn't seem to happen frequently relative to the absolute number of terms. This probably shows better gains as segments grow in size, due to standalone leaf nodes and greater density of terms. Regardless, I suspect that encoding frequency information may be more-or-less free.
Incremental merges
Except for level 0, when any level fills up we don't need to immediately merge it to the next level. We just need to have finished that merge before the next incoming merge hits. This means that when merging the 4kdoc segments into a 64kdoc segment, we can split it across 4k incremental steps.
Of course, that's probably not really reasonable, it implies we're only processing a handful of terms per pass. Instead, we could have a setting describing how much effort per insert we're willing to spend doing bookkeeping. Then we could split a big merge across a couple dozen inserts (or a couple hundred when doing the 64k->1M merge). Or it might be reasonable to do it in something like sqrt(N) steps. So the the merge to a 64kdoc segment, 256 steps, to 1mdoc 1024 steps.
NOTE: Given inline interior nodes, it might be reasonable to make the basic incremental merge increment be a single interior node worth of leaves, in which case the incremental merge state is merely a stack of partial interior nodes. This always make forward progress so long as single-document segments never require a full interior node.
Background merges
If we were storing segments outside sqlite, then one solution for merging would be to run the merge in the background, while the foreground continues to use the old segments for queries. When the merge is complete, a transaction could update the segment dir, then the old data could be deleted. While the work done is the same, the merge is removed from blocking anything (always assuming it doesn't starve anyone for shared CPU or disk or memory resources).
I haven't been able to think of a way to phrase this within the sqlite database, but perhaps there's a way.
Developer influence on segment merges
Right now, segment merges are always forced events. So while the average might be a couple hundred inserts per second, that average includes some inserts which themselves take many seconds. That kind of variance isn't really acceptable.
One option would be an ability to tell the system to stop doing merges entirely until asked to continue. That has some pretty severe downsides, to my mind.
An option I like better is to retain the current system as the baseline, but allow the developer to ask us to do preemptive merges. So the lower merge levels will pretty much work as they do now (they have less variance anyhow), but the higher-level merges might happen earlier than otherwise. That would mean that so long as the developer keeps calling us, things will remain very tight, but if they forget, things will remain adequate.
Either way, this would probably manifest as a table function. Tuning parameters might also manifest as table functions.
Break doclists for a term across multiple leaf nodes
After loading in the first 64kdocs of the Enron dataset, position and offsets, the "enron" doclist is 3M. If this were extended to 1M docs, that could be expected to run to 50M. That's not acceptable.
Large doclists could be written to leaf nodes by introducing a new type of height-1 interior node. After encoding the first term, it would switch to encoding docid deltas. These deltas would let you find the subtree which contains hits for a given docid for this term.
This complicates the segment merging code quite a bit, but would allow us to control the maximum amount of data we're willing to consider at once. If we put, say, 50k of doclist data per leaf, a single interior node might be able to span 10M docs or so.
Also, this would allow queries to target specific docid ranges, useful if the optimizer has an infrequent term and a very frequent term. (The current code can't say "What are the hits for 'the' in docid X", it has to read the entire 'the' doclist.)
Note that querying this ends up looking alot like doing a prefix query.
Overnight I ran a callgrind profile of loading 100k docs from Enron. Two interesting numbers:
472,379,370,844 PROGRAM TOTALS 91,726,422,328 fts2.c:getVarint
First, I have an improved merge strategy that converts an N^2 to an NlogN (by doing pairwise merges rather than using an accumulator). Both versions had a small constant, but it is an improvement. This drops about 1/4 of the getVarints (I believe the above already included this change).
Doing a full N-way merge function would additionally drop about half of the remaining overhead. That implies about 10%-15% gain.
Really interesting, though, would be to remove the need to actually decode. For instance, the current doclist elements are encoded like:
varint(iDocid) position and offset data varint(POS_END)
varint(iDocid) varint(nPosData) position and offset data
Beyond that, we could internally block doclists. When they exceed a threshold size, we could code them as:
varint(nData) varint(iEndDocidDelta) regular doclist encoding from here
encode only the data needed to distinguish children in interior nodes
If we flush a leaf after "week", and the first term in the next leaf is "weekend", then the interior node would currently store "weekend". The trailing "nd" does not help distinguish things, though, so we could store just "weeke".
An additional trick would be to force a break when a leaf node is larger than a threshold, and we see a shared prefix of 0. Then we only need to encode the first letter of the next term in the interior node. If that tightened up the interior nodes enough, it would have potential to drop an entire layer from the tree, which would be a big win so long as it didn't create so many new leaf nodes that the sqlite btree added a layer.
This might also apply if there's only a single shared prefix byte. Or maybe there could be two thresholds.
encode nPrefix and nSuffix in a single byte
Average term length is pretty small (call it five), and the average prefix and suffix is obviously smaller than that. In the first 64kdoc segment from loading Enron, almost all prefixes were 15 and under. Since leaf nodes contain sorted data, the suffixes also tended to be small.
So, we could encode nPrefix<16 and nSuffix<8 in a single byte. If either is exceeded, we could fallback to no prefix and encode 128+nTerm (or, we could encode 0, then nPrefix and nSuffix). (No, don't do the latter, the 0 wastes space. If nPrefix is most likely of the two to exceed 128, encode nPrefix+128 then nSuffix, or vice versa. That way we're more likely to need only 3 bytes.)
This is only of moderate value, because we'll only save a maximum of a byte per term. But it's a free byte, and not too complicated at all to implement. Additionally, awhile back I tested delta-encoding for interior nodes, and for most nodes it increased the space needed, because adjacent terms tended to have no shared prefix. It did save about 2% for the 64kdoc segment's height-1 interior nodes. This encoding might counter that.
UPDATE: A reasonable encoding is nSuffix*iScale+nPrefix when the result is <128,and nSuffix>0, nSuffix*128+nPrefix when nPrefix<128, and 0, nPrefix, nSuffix for other cases. Unfortunately, measurements show the results to be mixed. Very slight reduction to index size (.4% to 1%). More interesting is that a certain number of the segment trees end up one level shorter, which I'm not sure how to quantify. Having a one-level-shorter btree 1% of the time is probably much more interesting than a 1% gain in storage size.
UPDATE: Another reasonable encoding would be to interleave the prefix and suffix into the even and odd bits, then encode that. This is pretty cute! It doesn't allow for variability in how many bits are allocated to the prefix versus the suffix, though, other than varying the ordering such that one item is granted 3 bits and the other 4. Might also be slower.
store interior nodes inline
Currently, for various reasons the leaf nodes must be a contiguous range of block ids. Thus, the interior nodes at height 1 are accumulated in memory, then flushed to the next range of blocks (and so on up the tree, thought height 1 is obviously most interesting). The memory footprint is N/C, with C being really rather large. The 64kdoc segment in the Enron corpus has only 85 interior blocks total, implying a memory footprint under 170k. But the 1mdoc segment is going to need a couple megabytes.
Since the blocks encode their height in the first byte, it would be easy to have segment merges skip interior nodes which were stored inline. This would only be of slight impact on merge performance, since interior blocks are only perhaps half a percent of total blocks. Then interior nodes could just be flushed as needed.
The problem is that interior nodes rely on their subtree nodes being contiguous. This would hold for the height 1 nodes (their leaf nodes would be contiguous), but the height 2 and above nodes would need to be encoded differently. The 64kdoc segment had 20k leaf nodes, implying that level-1 interior nodes would be about 230 blockids apart. Adding a delta-encoded blockid per term at level 2 and above would thus be perhaps 2 bytes per term, which might reduce the fanout there from 230 to 180 or so. That doesn't seem too bad.
smaller rowids in %_segments
The current system stores all blocks in a single %_segments
table which is keyed by rowid. When we do a merge, the newly created blocks are necessarily receiving a higher rowid than the blocks the merged data is coming from, which blocks are eventually deleted. Since the new segment uses higher rowids, the deletes rowids cannot be reclaimed.
This can be finessed by using one segment table for segments in odd levels, another for even levels, and merging as soon as a level is full. For instance, if we have enough level-0 segments in the even table, we merge to a level-1 segment in the odd table, then delete the level-0 segments from the even table. Since they were at the highest rowids in the even table, those rowids can be reclaimed when we start generating more level-0 segments. The "merging as soon as full" clause means that when we merge level-1 segments into a level-2 segment, there won't be any higher rowids in the even table for level-0 segments.
This isn't very useful for segment encoding - we only refer to blockids in the segment dir, and at the front of each interior node, so even a large segment will only have a couple hundred blockids encoded. It might be more useful in increasing the btree density for the %_segments table. In experiments with the Enron corpus, it looks like the impact is about a 1/10 drop in max(rowid), plus another 1/2 from splitting across tables.