*** 2,7 **** --- 2,17 ---- ---- + *encode nPrefix and nSuffix in a single byte* + + Average term length is pretty small (call it five), and the average prefix and suffix is obviously smaller than that. In the first 64kdoc segment from loading Enron, almost all prefixes were 15 and under. Since leaf nodes contain sorted data, the suffixes also tended to be small. + + So, we could encode nPrefix<16 and nSuffix<8 in a single byte. If either is exceeded, we could fallback to no prefix and encode 128+nTerm (or, we could encode 0, then nPrefix and nSuffix). + + This is only of moderate value, because we'll only save a maximum of a byte per term. But it's a free byte, and not too complicated at all to implement. + + ---- + *store interior nodes inline* Currently, for various reasons the leaf nodes must be a contiguous range of block ids. Thus, the interior nodes at height 1 are accumulated in memory, then flushed to the next range of blocks (and so on up the tree, thought height 1 is obviously most interesting). The memory footprint is N/C, with C being really rather large. The 64kdoc segment in the Enron corpus has only 85 interior blocks total, implying a memory footprint under 170k. But the 1mdoc segment is going to need a couple megabytes.