how to retrieve 'n' rows from a LEFT JOIN

(1) By punkish on 2024-07-28 20:48:17 [link] [source]

This is related to my earlier post to which @SeverKetor helpfully provided very useful guidance.

If I have two tables, a and b, such that each row in a can have zero or many rows in b, how do I retrieve 'n' unique rows from a and all the related rows from b? A naive query like so won't work

SELECT 
    a.id, a.col1, a.col2,
    b.col1, b.col2
FROM
    a LEFT JOIN 
    b ON a.b_id = b.id
WHERE
    a.col1 = ?
GROUP BY
    a.id
ORDER BY a.id
LIMIT 30 OFFSET 0

for the moment, let's forget about the potential inefficiency of LIMIT ? OFFSET ?

The above query will get me 30 rows but the rows from table a (will/may) not be unique so when I group the query by a.id, I will likely get fewer than 30 rows.

If I am retrieving rows based on a WHERE clause on table a, I could retrieve 'n' rows from that table, and then retrieve the corresponding rows from table b and combine them. This is cumbersome but would work.

-- query1
SELECT 
    a.id, a.col1, a.col2
FROM
    a 
WHERE
    a.col1 = ?
ORDER BY a.id
LIMIT 30 OFFSET 0

-- query2
SELECT 
    b.col1, b.col2
FROM b
WHERE
    b.id IN (
        SELECT a.b_id 
        FROM a 
        WHERE a.col1 = ?
        ORDER BY a.id
        LIMIT 30 OFFSET 0
    )

But what if I have to perform the query with a WHERE clause on table b? Then I have to perform a JOIN on the two tables to get the rows from table a

SELECT 
    a.id, a.col1, a.col2,
    b.col1, b.col2
FROM
    a LEFT JOIN 
    b ON a.b_id = b.id
WHERE
    b.col1 = ? AND b.col2 = ?
GROUP BY
    a.id
ORDER BY a.id
LIMIT 30 OFFSET 0

I am back to my problem of very likely getting fewer than 30 unique rows of table a.

Seems like this should be a common problem with a simple solution. Hopefully I am overthinking it and someone can set me right.

(2) By SeverKetor on 2024-07-28 21:03:56 in reply to 1 [link] [source]

I was about to start thinking about how to write the query when I came up with the easier idea: just stop fetching rows once you hit 30. Unless you're using wrapper in another language that forces you to get all rows at once, I'm not sure there'd be an issue here.

(3) By _blgl_ on 2024-07-28 21:09:33 in reply to 1 [link] [source]

You can join subqueries.

select a30.id, a30.col1, a30.col2, b.col1, b.col2
from
    (select * from a
         where col1 = ?
         limit 30) a30
    left join b on a30.b_id = b.id
    where b.col1 = ? and b.col2 = ?;

(4) By punkish on 2024-07-28 21:29:19 in reply to 3 [source]

This seems ingeniously simple and do-able. I just tested it and it works. Of course, I get repeats of rows from table a whenever there are more than one related rows in table b, but I can group them in post-processing when I package them as JSON, ending up with 30 unique rows from table a.

Many thanks for this.

(5) By punkish on 2024-07-28 21:31:05 in reply to 2 [link] [source]

I am working with nodejs. I believe there might be a way to iterate over the records and stop after 30. But I think Bo's solution in this thread might be more elegant. In any case, I will try out both and go with the most performant one. Thanks for the idea.

(6.2) By punkish on 2024-07-29 10:54:51 edited from 6.1 in reply to 4 [link] [source]

A final update below for future readers:

first update: after a few more tests:

Bo's query is much faster, but the answer seems incorrect. The query first finds 30 rows from the first table where the first set of criteria are satisfied, then for that 30, it finds the related rows from the second table. The resulting answer has fewer rows than I expect.

SeverKetor's method of iterating and counting is much slower, but seems to query the entire set from the relationship. As a result, I get more rows, more in line with what I expect.

In both cases, I have to post-process the result to group the related rows so I get only 30 unique rows from the first table and the related rows from the second table.

Of course, it is possible that I am cross-eyed and am doing something wrong. I will double-triple check my code tomorrow and see if I can find the error of my ways.

In either case, thanks for both ideas.

final update

SeverKetor's method produces the correct result esp. when I have a WHERE clause on the second table. If the WHERE clause is only on the first table, both methods produce identical results and Bo's method is faster, but that is also because the result set is smaller (and thus, incomplete). Iteration is easier to implement no matter what kind of WHERE clause as I have to program for only one set of conditions.

I am using nodejs and better-sqlite3 which does have an iterate() method, so this works quite well.