Small. Fast. Reliable.
Choose any three.
The Shadow Pager is a design for a replacement to pager.c -- the objective is to extend SQLite with the ability to run read and write transactions concurrently. In particular, this design provides the following:

- Read transactions are never blocked by any other transactions

- Write transactions are never blocked by Read transactions

So, the only blocking that takes place is that Write transactions block other Write transactions.

The cost of this new freedom for read transactions is data duplication. Write transactions always write to free disk pages, so the written data cannot interfere with read transactions. The space taken by the old version(s) of the data cannot be reclaimed until the read transaction(s) begun before the new versions were committed complete. So, the amount of data duplication is application dependent, but is expected to be small in practice.

Note that since the prior version of all written data remains in the database, there is no need for a journal file. So in the best case, with quick read transactions and therefore low data duplication, the shadow pager will use the same disk space as the present journaling SQLite implementation.

See the attached pdf for more details.

If Tatu Ylonen's web page is unavailable, a cached version of his referenced paper is available here.

An archived copy of the shadows project web pages and the various papers referred to in this proposal can be found at: http://web.archive.org/web/20030622220243/http://www.cs.hut.fi/~ylo/shadows/shadows.html


By drh on 2004-Apr-14:
I have three issues with the shadow pager idea, in no particular order:

  1. The algorithm is less than 17 years old. That means that for all I know, it could be patented. Perhaps there is a submarine patent. Who knows. I prefer not to expose SQLite to patent risk. The current implementation uses 17+ year old technology exclusively.

  2. The shadow pager assumes that a one-sector disk write is an atomic operation. Is that true of modern disks? I don't know but I'm guessing not. When a power failure occurs, the voltage to the disk controller begins to decay. At some point, the voltage drops below the point where the disk controller can function. If that point happens to be when the first 1/3rd of the sector has been written, it seems to me that the next 2/3rds of the second would be unchanged, or perhaps overwritten with garbage. The current implementation might lose the entire database file if a power lose occurs during the write of an i-node on an unjournalled file system, but on a filesystem with at least meta-data journalling, a partial sector write does not corrupt the database, as far as I am aware.

  3. Unless I missed something, the shadow pager allows a single write to occur while a read is ongoing, but not two consecutive writes. For example, the following is allowed:

                         |-- write ---|
             |--------------------------- read ----------------------|

                         |-- write ---|      |--- write ---|
             |--------------------------- read ----------------------|

Ben Carlyle's proposal (or variations thereof) allows a write operation to start while reads are still ongoing. The write cannot commit until all the reads finish. But this still gives you some of what the shadow pager offers. Specifically:

                         |-- write ---|
             |-------- read -------|

The current implemention requires that all reads end before the write begins. Like this:

                           |-- write ---|
             |--- read ---|

The current plan is to stay conservative and implement a modified version of the Carlyle proposal in version 3.0.

Attachments:

  • shadow.pdf 55188 bytes added by anonymous on 2004-Mar-31 03:50:15 UTC.