Small. Fast. Reliable.
Choose any three.
The Shadow Pager is a design for a replacement to pager.c -- the objective is to extend SQLite with the ability to run read and write transactions concurrently. In particular, this design provides the following:

- Read transactions are never blocked by any other transactions

- Write transactions are never blocked by Read transactions

So, the only blocking that takes place is that Write transactions block other Write transactions.

The cost of this new freedom for read transactions is data duplication. Write transactions always write to free disk pages, so the written data cannot interfere with read transactions. The space taken by the old version(s) of the data cannot be reclaimed until the read transaction(s) begun before the new versions were committed complete. So, the amount of data duplication is application dependent, but is expected to be small in practice.

Note that since the prior version of all written data remains in the database, there is no need for a journal file. So in the best case, with quick read transactions and therefore low data duplication, the shadow pager will use the same disk space as the present journaling SQLite implementation.

See the attached pdf for more details.

If Tatu Ylonen's web page is unavailable, a cached version of his referenced paper is available here.

An archived copy of the shadows project web pages and the various papers referred to in this proposal can be found at: http://web.archive.org/web/20030622220243/http://www.cs.hut.fi/~ylo/shadows/shadows.html


By drh on 2004-Apr-14:
I have three issues with the shadow pager idea, in no particular order:

  1. The algorithm is less than 17 years old. That means that for all I know, it could be patented. Perhaps there is a submarine patent. Who knows. I prefer not to expose SQLite to patent risk. The current implementation uses 17+ year old technology exclusively.

      Kent used these algorithms in 1985 [J Kent, H Garcia Molina, and J Chung "An experimental evalution of crash recovery mechanisms" In ACM PODS pages 113-121, 1985; J M Kent "Performance and Implementation Issues in Database Crash Recovery" PhD thesis, Princeton University, 1985] The ideas can be traced back through System R [J Gray, P McJones, M Blasgen, B Lindsay, R Lorie, T Price, F Putzolu, and I Traiger, "The recovery manager of the System R database manager", Computing Surveys 13(2), 1981] and [R A Lorie "Physical integrity in a large segmented database" ACM Transactions on Database Systems 2(1), 1977] -- e (2004-04-14)

  2. The shadow pager assumes that a one-sector disk write is an atomic operation. Is that true of modern disks? I don't know but I'm guessing not. When a power failure occurs, the voltage to the disk controller begins to decay. At some point, the voltage drops below the point where the disk controller can function. If that point happens to be when the first 1/3rd of the sector has been written, it seems to me that the next 2/3rds of the second would be unchanged, or perhaps overwritten with garbage. The current implementation might lose the entire database file if a power lose occurs during the write of an i-node on an unjournalled file system, but on a filesystem with at least meta-data journalling, a partial sector write does not corrupt the database, as far as I am aware.

      Assuming that an SQLite page is bigger than a disk sector (typically 512 bytes), and that disk sectors do not span SQLite pages (true if both are power of 2 sized) then any partially changed sector will only occur to a free page, or to page 1. I mention in the attached pdf that it might be a good idea to "write ahead log" writes to page 1. Note that if either of the above assumptions is not true, the present SQLite will have the same database corruption issues on power failure. -- e (2004-04-14)

  3. Unless I missed something, the shadow pager allows a single write to occur while a read is ongoing, but not two consecutive writes. For example, the following is allowed:

                         |-- write ---|
             |--------------------------- read ----------------------|

                         |-- write ---|      |--- write ---|
             |--------------------------- read ----------------------|

Ben Carlyle's proposal (or variations thereof) allows a write operation to start while reads are still ongoing. The write cannot commit until all the reads finish. But this still gives you some of what the shadow pager offers. Specifically:

                         |-- write ---|
             |-------- read -------|

The current implemention requires that all reads end before the write begins. Like this:

                           |-- write ---|
             |--- read ---|

The current plan is to stay conservative and implement a modified version of the Carlyle proposal in version 3.0.


By bbrandys on 2004-08-01:

My proposition : look at Firebird RDBMS at http://firebird.sourceforge.net and also http://www.ibphoenix.com
It is using multigenerational database engine for years . There are some technical FAQ articles on those sites, describing in detail something called here "shadow pager" (multiversioning database)


By kevin lyda on 2005-03-04:

two suggestions:

it would be nice to have gcc idiot check the _printf functions. there's an __attribute__ method to do this, but it gets confused by the %q and %Q format codes. w/o changes to gcc it's not possible to do. if i pestered the gcc folks - and i don't know how successful i'd be - would there be any interest in modifying the headerfiles to define these attributes:

    int sqlite_exec_printf(sqlite *db,
                           char *sqlfmt,
                           int (*xCallback)(void*,int,char**,char**),
                           void *pArg,
                           char **errmsg,
                           ...)  __attribute__((format(printf, 2, 6)));

it really does catch a fair number of silly and yet painful errors.

second suggestion:

i sometimes would like to log all my sql calls - but using my own logging functions. i'd also like it to be something i can turn on and off at run time. using version 2.8.15 i made the following change:

    --- sqlite/src/main.c   13 Jan 2005 10:54:45 -0000      1.1
    +++ sqlite/src/main.c   4 Mar 2005 13:01:45 -0000
    @@ -463,6 +463,9 @@
     const char sqlite_encoding[] = "iso8859";
     #endif

    +/* logger function */
    +void (*sqlite_logger_func)(const char *);
    +
     /*
     ** Open a new SQLite database.  Construct an "sqlite" structure to define
     ** the state of this database and return a pointer to that structure.
    @@ -635,6 +638,9 @@
       int nCallback;

       if( zSql==0 ) return SQLITE_OK;
    +  if (sqlite_logger_func)
    +      (*sqlite_logger_func)(zSql);
    +
       while( rc==SQLITE_OK && zSql[0] ){
         pVm = 0;
         rc = sqlite_compile(db, zSql, &zLeftover, &pVm, pzErrMsg);

    --- sqlite/src/sqlite.h.in      13 Jan 2005 10:54:46 -0000      1.1
    +++ sqlite/src/sqlite.h.in      4 Mar 2005 13:09:07 -0000
    @@ -59,6 +59,11 @@
     typedef struct sqlite sqlite;

     /*
    +** you can log all sql statements by assigning this variable a function
    +*/
    +extern void (*sqlite_logger_func)(const char *);
    +
    +/*
     ** A function to open a new sqlite database.
     **
     ** If the database does not exist and mode indicates write

Dan: Have you seen the similar sqlite3_trace() API in SQLite 3? Can't remember if that was in 2.8.15 or not. Probably not.
    Dan: ah, yes, i see it now. it is in 2.8.15. the docs don't mention it - though searching for trace in the source was a pretty obvious step.

    the format string change would still be nice. let me know if there's interest in it.


SQLite storage in a compressed format. I am using SQLite in an embedded platform with limited storage. This seems difficult to implement with read-write. However, I am just recording data (write once).


by Mike Sample 2005-09-01

Stream-style r/w access to BLOBs and large text values would be nice.

by Anonymous 2007-06-28

incremental BLOB API is implemented in SQLite 3.4.0


2006-03-31

A nice feature would be a function that tests wether a sting is a valid sql-query. The complete function only looks wether there is a ";" at the end.

To tell if a string is valid SQL you need to parse it and verify all the referenced tables and columns exist etc. This is exactly what sqlite3_prepare does. You can write your own function to do this as shown below. It prepares the statement, finalizes the generated statement if any, and checks that the entire SQL string was parsed. If it is valid SQL it return SQLITE_OK, otherwise you will get an sqlite error code.

    int sql_valid(sqlite3 *db, const char *sql)
    {
        int len;
        sqlite3_stmt *s;
        const char *t;
        int err;

        if (!db || !sql)
            return SQLITE_ERROR;

        len = strlen(sql);

        err = sqlite3_prepare(db, sql, len, &s, &t);
        if (s)
            sqlite3_finalize(s);
        if (err)
            return err;
        if (t != sql + len)
            return SQLITE_ERROR
        return SQLITE_OK;
    }


2006-05-20

Binary date handling instead of text based. For example using 32-bit size of date is 4 bytes instead of 10 (2006-02-02) the result is 6 bytes saved

You can currently store your dates as unix timestamps (i.e integer count of seconds since Jan 1, 1970). The time and date functions at http://www.sqlite.org/cvstrac/wiki?p=DateAndTimeFunctions show how convert these values to time and/or date strings for display as well as other formats such as Julian day numbers.


2006-06-29 example:

select * from tblproject where projectid in (3,1,2) order by field(projectid,3,1,2)

sorting with my own rules (function from mysql)


2006-10-29 Is possible to upload precompiled binary dll for PocketPC (ARM)?


2007-01-29 Now that CHECK constraints are enforced, there should be a standard, stable way to

possibly via a new PRAGMA. This allows programs to more completely


Attachments:

  • shadow.pdf 55188 bytes added by anonymous on 2004-Mar-31 03:50:15 UTC.