Small. Fast. Reliable.
Choose any three.
Consider a database schema such as the following:

    CREATE TABLE node(
       id INTEGER PRIMARY KEY,
       name TEXT
    );
    CREATE INDEX node_idx ON node(name);
    CREATE TABLE edge(
       orig INTEGER REFERENCES node,
       dest INTEGER REFERENCES node,
       PRIMARY KEY(orig, dest)
    );
    CREATE INDEX edge_idx ON edge(dest,orig);

This database defines a directed graph with the ability to store a name at each node. Now consider a query against this database:

    SELECT *
      FROM edge AS e,
           node AS n1,
           node AS n2
     WHERE n1.name = 'alice'
       AND n2.name = 'bob'
       AND e.orig = n1.id
       AND e.dest = n2.id;

What this query is asking for is all information about edges that go from nodes labeled "alice" to nodes labeled "bob".

The query optimizer in SQLite has basically two choices on how to implement this query. Pseudocode demonstrating these two choices is as follows:

Option 1:

    foreach n1 where n1.name='alice' do:
      foreach n2 where n2=name='bob' do:
        foreach e where e.orig=n1.id and e.dest=n2.id
          return n1.*, n2.*, e.*
        end
      end
    end

Option 2:

    foreach n1 where n1.name='alice' do:
      foreach e where e.orig=n1.id do:
        foreach n2 where n2.id=e.dest and n2.name='bob' do:
          return n1.*, n2.*, e.*
        end
      end
    end

The same indices are used to speed up every loop in both options. The only difference in these two query plans is the order in which the loops are nested. You could generate other query plans by interchanging n1 and n2 but the result would be equivalent so we will only consider the two query plans above.

So which query plan is better? It turns out that the answer depends on what kind of data is found in the node and edge tables.

Let the number of alice nodes be M and the number of bob nodes be N. Consider two scenarios. In the first scenario, M and N are both 2 but there are thousands of edges on each node. In this case, option 1 is perferred. With option 1, the inner loop checks for the existence of an edge between a pair of nodes and outputs the result if found. But because there are only 2 alice and bob nodes each, the inner loop only has to run 4 times and the query is very quick. Option 2 would take much longer here. The outer loop of option 2 only executes twice, but because there are a large number of edges leaving each alice node, the middle loop has to iterate many thousands of times. It will be much slower. So in the first scenario, we prefer to use option 1.

Now consider the case where M and N are both 3500. Alice nodes are abundant. But suppose each of these nodes is connected by only one or two edges. In this case, option 2 is preferred. With option 2, the outer loop still has to run 3500 times, but the middle loop only runs once or twice for each outer loop and the inner loop will only run once for each middle loop, if at all. So the total number of iterations of the inner loop is around 7000. Option 1, on the other hand, has to run both its outer loop and its middle loop 3500 times each, resulting in 12 million iterations of the middle loop. Thus in the second scenario, option 2 is nearly 2000 times faster than option 1.

So you can see that depending on how the data is structured in the table, either query plan 1 or query plan 2 might be better. How does SQLite decide?

If you do not give SQLite any hints, it will choose option 1. This is the crux of ticket #1414. In that ticket there is a situation similar to the second scenario. By default, SQLite chooses option 1, which for the particular data set described in ticket #1414 is very slow. But if the ANALYZE command is run in order to gather statistics, SQLite chooses option 2 instead, which is fast in comparison.

Query Plans

In SQLite, a join is always executed as a series of nested loop. Prior to version 3.2.3, the order of the tables in the FROM clause determined the nesting of the loops. The first table in the FROM clause became the outer loop and the last table became the inner loop and all intermediate tables appeared in order as middle loops. In this way, the SQL programmer had complete control over the query plan.

Beginning with version 3.2.3, the SQLite query optimizer has the ability to reorder tables in the FROM clause if it thinks doing so will result in a faster query. SQLite uses a greedy algorithm to determine loop nesting order. When developing a query plan, the SQLite optimizer looks first for a table in the join that requires the least amount of work to be processed. This table becomes the outer loop. Then SQLite looks for the next easiest table to process which becomes the next loop, and so forth. In the event of a tie, the order of the tables in the original FROM clause becomes the tie breaker.

To determine how much work is required to process a table, SQLite considers such factors as the availability of indices, whether or not the use of indices will obviate the need to sort, and how selective an index is. Some indices might reduce a search from a million rows down to a few thousand rows. Other indices might reduce the search down to a single row. The latter indices would be preferred, of course. The purpose of the ANALYZE command is to gather statistics on how selective an index is so that it can make an informed guess about which indices will reduce the search to thousands of rows versus one or two rows.

Beginning in version 3.2.6, there is an experimental command that will display a synopsis of the query plan chosen. Recall that by prepending the keyword EXPLAIN to any SQLite statement you can get a listing of the virtual machine program that implements the statement. The new experimental feature is to prepend three keywords: EXPLAIN QUERY PLAN. With this three-word prefix, the result is a succinct description of which tables are scanned and which indices are used to help with the scan. From this it is much easier to deduce the query plan than by trying to decypher a potentially long and cryptic VM program.

Manual Control Of Query Plans

SQLite provides the ability for advanced programmers to exercise control over the query plan chosen by the optimizer. One method for doing this is to fudge the ANALYZE results. All results from the ANALYZE command are stored in a special table named SQLITE_STAT1. This table can only be created by running the ANALYZE command, but once the table exists, knowledgable programmers can alter its contents using standard INSERT, UPDATE, and DELETE commands in order to fake out the query optimizer. For an program that uses an SQLite database as its application file format, one development approach would be to construct a large database containing typical data and run the ANALYZE command to gather statistics. Then modify the program so that when it creates new SQLite databases, it runs the ANALYZE command during schema creation in order to create the SQLITE_STAT1 table, then loads the SQLITE_STAT1 table with values saved from trial runs during development. In that way, statistics from large working data sets can be preloaded into newly created application files.

Another approach to manually controlling the query plan is to use some peculiar syntax on joins that disables the table reordering optimization. If you use the keyword CROSS in a join, then the two tables connected by that join will not be reordered. So in this query, the optimizer is free to reorder the tables of the FROM clause anyway it sees fit:

    SELECT *
      FROM node AS n1,
           edge AS e,
           node AS n2
     WHERE n1.name = 'alice'
       AND n2.name = 'bob'
       AND e.orig = n1.id
       AND e.dest = n2.id;

But in the following logically equivalent formulation of the query, the substitution of "CROSS JOIN" for the "," means that the order of tables must be N1, E, N2.

    SELECT *
      FROM node AS n1 CROSS JOIN
           edge AS e CROSS JOIN
           node AS n2
     WHERE n1.name = 'alice'
       AND n2.name = 'bob'
       AND e.orig = n1.id
       AND e.dest = n2.id;

Hence, in the second form, the query plan must be option 2. Note that you must use the keyword CROSS in order to disable the table reordering optimization. INNER JOIN, NATURAL JOIN, JOIN, and other similar combinations work just like a comma join in that the optimizer is free to reorder tables as it sees fit. (Table reordering is also disabled on an outer join, but that is because outer joins are not associative or commutative. Reordering tables in outer joins changes the result.) The use of the CROSS keyword to disable the table reordering by the optimizer is a planned feature for SQLite 3.2.6.


Postgresql apparently uses the same trick (of explicit CROSS JOIN) to disable table reordering. See http://www.postgresql.org/docs/8.1/interactive/explicit-joins.html