SQLite not using covering indices when available

(1.1) By franztesca on 2024-07-29 15:17:45 edited from 1.0 [link] [source]

I'm trying to optimize my application by adding some covering indexing for important queries that we run frequently. Some of these indices would also include a string-comparison expression, which should remove the need of string compare as the result of the comparsison should be already in the index. However, it seems that no matter what, SQLite will not use the available indices unless forced with INDEXED BY.

Is there a rational reason why SQLite would do a row lookup (+ expression evaluation) instead of using the available covering index?

This is a playground (can be executed on the sqlite fiddle) similar to my setup:

CREATE TABLE IF NOT EXISTS test(id PRIMARY KEY, value1, value2);
CREATE INDEX IF NOT EXISTS test_index_1 ON test(id, value1, value2 = 'test_value');
CREATE INDEX IF NOT EXISTS test_index_2 ON test(id, value1, value2);

REPLACE INTO test VALUES ('a', 10, 'test_value'), ('b', 'abc', 'test_value'), ('c', 'abc', 'hello');

EXPLAIN QUERY PLAN SELECT id FROM test WHERE id = 'abc' AND value1 = 42 AND value2 = 'test_value';
EXPLAIN QUERY PLAN SELECT id FROM test INDEXED BY test_index_1 WHERE id = 'abc' AND value1 = 42 AND value2 = 'test_value';
EXPLAIN QUERY PLAN SELECT id FROM test INDEXED BY test_index_2 WHERE id = 'abc' AND value1 = 42 AND value2 = 'test_value';

and the output is

QUERY PLAN
`--SEARCH test USING INDEX sqlite_autoindex_test_1 (id=?)
QUERY PLAN
`--SEARCH test USING INDEX test_index_1 (id=? AND value1=?)
QUERY PLAN
`--SEARCH test USING COVERING INDEX test_index_2 (id=? AND value1=? AND value2=?)

Why isn't the expression in test_index_1 used even if the query has the same exact expression? Why doesn't the query use either test_index_1 or test_index_2 by default?

Thanks to anyone who can clarify!

(2) By drh on 2024-07-29 15:21:57 in reply to 1.0 [source]

SQLite chooses the index by picking the one that is estimated to give the lowest total runtime. So the reason why it chooses the PRIMARY KEY is because it believes that will be fastest.

The PRIMARY KEY is a unique index after all. It will only return a single row. The secondary indexes are not unique. If you make test_index_1 UNIQUE, the query planner will choose it instead of the PRIMARY KEY.

As far as SQLite sees, the test_index_1 index might return multiple rows. It does not understand that it must a unique index because it contains the PRIMARY KEY. That is not something that has come up before, nor something that I've every thought to try to optimize.

(3) By franztesca on 2024-07-29 16:20:25 in reply to 2 [link] [source]

Ah, that makes sense, thanks a lot!

However, there is still the second point unresolved: why isn't test_index_1 (the expression-based) a valid covering index? Even after adding the UNIQUE constraint, SQLite will refuse to choose test_index_1 (which should be a better index than test_index_2) and will not use test_index_1 as a covering index even when forced.

QUERY PLAN
`--SEARCH test USING COVERING INDEX test_index_2 (id=? AND value1=? AND value2=?)
QUERY PLAN
`--SEARCH test USING INDEX test_index_1 (id=? AND value1=?)
QUERY PLAN
`--SEARCH test USING COVERING INDEX test_index_2 (id=? AND value1=? AND value2=?)

Any idea why?

(4) By drh on 2024-07-29 16:28:54 in reply to 3 [link] [source]

The query planner is just not smart enough to deal with that right now. You can work around the limitation by writing:

  SELECT id
    FROM test
   WHERE id='abc'
     AND value1=42
     AND (value2='test value')=TRUE;

In word: make the constraint an equality constraint rather than a bare boolean value.

(5) By franztesca on 2025-01-18 13:49:42 in reply to 2 [link] [source]

Sorry for re-opening this old discussion, but I found a similar case that is not solved by setting the covering index to unique: the query planner will choose the primary key in any case. This is reproducible with the following example:

CREATE TABLE IF NOT EXISTS person(id TEXT PRIMARY KEY);
CREATE TABLE IF NOT EXISTS company(id TEXT PRIMARY KEY, profitable); 
CREATE TABLE IF NOT EXISTS personCompany(personID, companyID, PRIMARY KEY (personID, companyID));

# This covering index is there to avoid the row lookup when we want to just get the profitable value 
CREATE UNIQUE INDEX IF NOT EXISTS companyProfitableCovering ON company(id, profitable);

# With subquery and no INDEXED BY
EXPLAIN QUERY PLAN SELECT id, (
  SELECT COUNT(*) FROM personCompany
  WHERE personID = person.id AND (
    SELECT profitable FROM company
    WHERE id = personCompany.companyID
  )
) FROM person;

# With subquery and INDEXED BY
EXPLAIN QUERY PLAN SELECT id, (
  SELECT COUNT(*) FROM personCompany
  WHERE personID = person.id AND (
    SELECT profitable FROM company INDEXED BY companyProfitableCovering
    WHERE id = personCompany.companyID
  )
) FROM person;

# With join and no INDEXED BY
EXPLAIN QUERY PLAN SELECT person.id, COUNT(company.id) FROM person
LEFT JOIN personCompany on person.id = personID
LEFT JOIN company ON company.id = companyID
WHERE profitable
GROUP BY person.id;

# With join and INDEXED BY
EXPLAIN QUERY PLAN SELECT person.id, COUNT(company.id) FROM person
LEFT JOIN personCompany on person.id = personID
LEFT JOIN company INDEXED BY companyProfitableCovering ON company.id = companyID
WHERE profitable
GROUP BY person.id;

The 4 queries print the following plans:

# With subquery and no INDEXED BY
QUERY PLAN
|--SCAN person
`--CORRELATED SCALAR SUBQUERY 2
   |--SEARCH personCompany USING COVERING INDEX sqlite_autoindex_personCompany_1 (personID=?)
   `--CORRELATED SCALAR SUBQUERY 1
      `--SEARCH company USING INDEX sqlite_autoindex_company_1 (id=?)

# With subquery and INDEXED BY
QUERY PLAN
|--SCAN person
`--CORRELATED SCALAR SUBQUERY 2
   |--SEARCH personCompany USING COVERING INDEX sqlite_autoindex_personCompany_1 (personID=?)
   `--CORRELATED SCALAR SUBQUERY 1
      `--SEARCH company USING COVERING INDEX companyProfitableCovering (id=?)

# With join and no INDEXED BY
QUERY PLAN
|--SCAN person USING COVERING INDEX sqlite_autoindex_person_1
|--SEARCH personCompany USING COVERING INDEX sqlite_autoindex_personCompany_1 (personID=?) LEFT-JOIN
`--SEARCH company USING INDEX sqlite_autoindex_company_1 (id=?)

# With join and INDEXED BY
QUERY PLAN
|--SCAN person USING COVERING INDEX sqlite_autoindex_person_1
|--SEARCH personCompany USING COVERING INDEX sqlite_autoindex_personCompany_1 (personID=?) LEFT-JOIN
`--SEARCH company USING COVERING INDEX companyProfitableCovering (id=?)

Indicating that the covering index is there and usuable, but not chosen by the planner until forced with INDEXED BY. In our application we deal with lots of records and want to optimize as much as possible reading speed, for which we though about using such a covering index. However, it doesn't work as expected.

Any clues on what we could be doing wrong?

Thank you!

(6) By drh on 2025-01-18 14:07:01 in reply to 5 [link] [source]

We ... want to optimize as much as possible reading speed,

Have you actually measured suboptimal performance, or are you just guessing based on what you see in the EXPLAIN QUERY PLAN output? Do you have an test case (with test data) that we can run and time to see suboptimal performance? Can you please share your test data?

(7) By franztesca on 2025-01-18 16:21:31 in reply to 6 [link] [source]

In our case, the query without covering index takes on average ~2300ms, where the same query with the covering index (forced by INDEXED BY) takes on average ~800ms.

The dataset in our test case (taken from a real-world user case) has the following number of entries (in respect to the example above):

person: ~5000 rows
company: ~150k rows
personCompany: ~670k rows (on average each company has ~4 people)

The base query with the subquery generates the following .scanstat on:

QUERY PLAN
|--SCAN person                                                                                  (loops=1 rows=5353)
`--CORRELATED SCALAR SUBQUERY 2                                                                 (loops=5353)
   |--CORRELATED SCALAR SUBQUERY 1                                                              (loops=672858)
   |  `--SEARCH company USING INDEX sqlite_autoindex_company_1 (id=?)                           (loops=672858 rows=672858)
   `--SEARCH personCompany USING COVERING INDEX sqlite_autoindex_personCompany_1 (personID=?)   (loops=5353 rows=672858)

By increasing the cache size to 4GB (enough to hold the whole DB in memory), the timing goes to ~1400ms for the query without covering index and ~620ms for the same query with the covering index (forced by INDEXED BY).

If you find it useful, I can send you an stripped-down/anonymized version of the database.

(11.1) By drh on 2025-01-24 19:05:51 edited from 11.0 in reply to 7 [link] [source]

Please do send the stripped-down/anonymized version of the database.

In the test script below, I'm getting essentially identical timings regardless of whether or not the companyProfitableCoverging index is used:

CREATE TABLE person(id TEXT PRIMARY KEY);
WITH RECURSIVE cnt(n) AS (VALUES(1) UNION ALL SELECT n+1 FROM cnt WHERE n<10000)
  INSERT INTO person(id) SELECT n FROM cnt;
CREATE TABLE company(id TEXT PRIMARY KEY, profitable);
WITH RECURSIVE cnt(n) AS (VALUES(1) UNION ALL SELECT n+1 FROM cnt WHERE n<100)
  INSERT INTO company(id, profitable) SELECT n, n%2 FROM cnt;
CREATE TABLE personCompany(personID,companyID,PRIMARY KEY(personID,companyID));
WITH RECURSIVE c1(x) AS (VALUES(1) UNION ALL SELECT x+1 FROM c1 WHERE x<1000),
               c2(y) AS (VALUES(1) UNION ALL SELECT y+1 FROM c2 WHERE y<100)
  INSERT INTO personCompany(personID,companyID) SELECT x, y FROM c1, c2;
CREATE UNIQUE INDEX companyProfitableCovering ON company(id, profitable);

.echo on
.mode count
.timer on

/* With subquery and no INDEXED BY */
SELECT id, (
  SELECT COUNT(*) FROM personCompany
   WHERE personID = person.id
     AND (
           SELECT profitable 
             FROM company
            WHERE id = personCompany.companyID
         )
) FROM person;
/* With subquery and INDEXED BY */
SELECT id, (
  SELECT COUNT(*) FROM personCompany
   WHERE personID = person.id
     AND (
            SELECT profitable
              FROM company INDEXED BY companyProfitableCovering
             WHERE id = personCompany.companyID
         )
) FROM person;

/* With join and no INDEXED BY */
SELECT person.id, COUNT(company.id)
  FROM person
  LEFT JOIN personCompany on person.id = personID
  LEFT JOIN company ON company.id = companyID AND profitable
 GROUP BY person.id;
/* With join and INDEXED BY */
SELECT person.id, COUNT(company.id)
  FROM person
  LEFT JOIN personCompany on person.id = personCompany.personID
  LEFT JOIN company INDEXED BY companyProfitableCovering
                    ON company.id = personCompany.companyID AND profitable
 GROUP BY person.id;

(12) By anonymous on 2025-01-24 19:44:01 in reply to 11.1 [link] [source]

I am cringing reading what this query appears to be doing. Are you tracking people based on profitability to a company? seems like it's awfully close to the lines regarding the license terms of sqlite.

(now walking away quickly, everybody carry on! - sure there will be blowback, but it needed to be said)

(13) By drh on 2025-01-24 19:48:07 in reply to 12 [link] [source]

I believe the example database has been greatly simplified from a real-world case and the simplifications distort the semantics significantly. I would not be alarmed.

(8) By drh on 2025-01-18 18:45:22 in reply to 5 [link] [source]

Why not just drop the "companyProfitableCovering" index and make the "company" table a WITHOUT ROWID table. I think that will resolve any performance issues you are having, with the added benefit of making your database smaller. (You have your data, and I don't, so you will have to be the one to verify the previous sentence.) For the schema you show, you would do well, I think, to make all three tables WITHOUT ROWID.

(9) By franztesca on 2025-01-19 10:28:30 in reply to 8 [link] [source]

The provided schema is a minimal reproducible example. In practice our schema has more columns for both person and company, company in particular has 40 columns, so each row is quite big. IDs are standard uuid strings. I tried moving both tables to WITHOUT ROWID (and even just moving moving company), the performance greatly deteriorates: the same query now takes around ~10500ms. The query plan seems as expected:

QUERY PLAN
|--SCAN person                                                                                  (loops=1 rows=5353)
`--CORRELATED SCALAR SUBQUERY 2                                                                 (loops=5353)
   |--CORRELATED SCALAR SUBQUERY 1                                                              (loops=672858)
   |  `--SEARCH company USING PRIMARY KEY (id=?)                                                (loops=672858 rows=672858)
   `--SEARCH personCompany USING COVERING INDEX sqlite_autoindex_personCompany_1 (personID=?)   (loops=5353 rows=672858)

So probably not a viable option for us.

(10) By slavin on 2025-01-19 14:03:14 in reply to 1.1 [link] [source]

Please use the ANALYZE command on that database, then check to see whether it changes either times or query plans.