*** 2,7 **** --- 2,50 ---- ---- + Overnight I ran a callgrind profile of loading 100k docs from Enron. Two interesting numbers: + + <html><pre> + 472,379,370,844 PROGRAM TOTALS + 91,726,422,328 fts2.c:getVarint + </pre></html> + + Ick. Adding up all of the related stuff, I estimate that about 30% of the total is in doclist decoding. That's actually good, it means there's room for improvement :-). + + First, I have an improved merge strategy that converts an N^2 to an NlogN (by doing pairwise merges rather than using an accumulator). Both versions had a small constant, but it is an improvement. This drops about 1/4 of the getVarints (I believe the above already included this change). + + Doing a full N-way merge function would additionally drop about half of the remaining overhead. That implies about 10%-15% gain. + + Really interesting, though, would be to remove the need to actually decode. For instance, the current doclist elements are encoded like: + + <html><pre> + varint(iDocid) + position and offset data + varint(POS_END) + </pre></html> + + This could instead be encoded as: + + <html><pre> + varint(iDocid) + varint(nPosData) + position and offset data + </pre></html> + + For short position lists, this would result in identical space usage and always save decoding overhead. For long position lists, this would increase space usage a bit, but would have much greater savings in overhead. I estimate that this would cut another order of magnitude out of the decoding overhead. + + Beyond that, we could internally block doclists. When they exceed a threshold size, we could code them as: + + <html><pre> + varint(nData) + varint(iEndDocidDelta) + regular doclist encoding from here + </pre></html> + + iEndDocidDelta would be added to the first docid to get the last docid from the next nData bytes. In many cases, this information is enough to let us skip the block. With 1k blocks, this would cut another 2 orders of magnitude from the decoding overhead. + + ---- + *encode only the data needed to distinguish children in interior nodes* If we flush a leaf after "week", and the first term in the next leaf is "weekend", then the interior node would currently store "weekend". The trailing "nd" does not help distinguish things, though, so we could store just "weeke".