*** 2,7 **** --- 2,17 ---- ---- + *Base docids per segment* + + Delta-encoding of docids kicks in strongly for longer doclists, but doesn't help much for short doclists. In the limit, a single-document segment for a high docid will spend much more overhead to encode identical docids. A solution would be to have a per-segment base docid which all doclists in that segment delta-encode from. For a single-document segment, this would make all leading docids 0, which could then be omitted. + + This has greater impact on segments with few documents. As segments grow in size, the average delta of leading docids from the base docid will increase. Also, this gain is naturally limitted by how many doclists are in the segment in the first place. It's not entirely clear that this would be worth doing, given the complexities involved. + + A variant of this idea would be to use segment-local docids. This has the additional advantage that the mapping could order documents descending by their term counts, so that documents with many terms would naturally encode with fewer varint bytes. + + ---- + *Term directory for frequency and other info* At some point, we're likely to add certain per-term information so that we can optimize queries. For instance, we might like to order AND merges from smallest doclist to largest. This may also allow for more efficient encoding. The term dictionary could be encoded similar to the current btree, but the delta-encoding should work better due to having longer runs of terms together. The leaf-node encoding would only need to store the termid of the first term, and the others can be assumed to increment, meaning leaves would store straight doclist data.