Friday, March 21, 2025

SQL ANTI JOIN vs NOT IN vs NOT EXISTS in Rails and ActiveRecord

SQL is not my specialty but from time to time it is necessary to pull off something more elaborate.

My way to explain ANTI JOIN is to find records in a table that lack an associated record in another.

Lets imagine we have a table "Messages" and a table "MessageRecipients". We want to find out which Messages do not have any recipient records or if they have, that all of them are marked as deleted.

But lets first look at a simpler way to achieve this

Using NOT IN

 Look at this relatively easy to understand scope

```
scope :stale, -> do
  where.not(id: MessageRecipient.not_deleted.select(:message_id))
end

```

The `non_deleted` scope results in `deleted_at IS NULL`, not important to include here.

This results in the following SQL:

```
SELECT `messages`.* FROM `messages` WHERE `messages`.`id` NOT IN (SELECT `message_recipients`.`message_id` FROM `message_recipients` WHERE `message_recipients`.`deleted_at` IS NULL)
```

It is a correct query but doesn't allow MySQL to realize it's purpose is anti-join. Thus  the explanation is this:

```
mysql> EXPLAIN SELECT `messages`.* FROM `messages` WHERE `messages`.`id` NOT IN

(SELECT `message_recipients`.`message_id` FROM `message_recipients` WHERE `messa
ge_recipients`.`deleted_at` IS NULL);
+----+--------------+--------------------+------------+--------+----------------
---------------------------------+---------------------+---------+--------------
-----------------+----------+----------+-------------------------+
| id | select_type  | table              | partitions | type   | possible_keys  
                                | key                 | key_len | ref          
                | rows     | filtered | Extra                   |
+----+--------------+--------------------+------------+--------+----------------
---------------------------------+---------------------+---------+--------------
-----------------+----------+----------+-------------------------+
|  1 | SIMPLE       | messages           | NULL       | ALL    | NULL           
                                | NULL                | NULL    | NULL         
                | 21464936 |   100.00 | NULL                    |
|  1 | SIMPLE       | <subquery2>        | NULL       | eq_ref | <auto_distinct_
key>                             | <auto_distinct_key> | 9       | system_enterp
rise.messages.id |        1 |   100.00 | Using where; Not exists |
|  2 | MATERIALIZED | message_recipients | NULL       | ALL    | index_message_r
ecipients_on_message_id_and_kind | NULL                | NULL    | NULL         
                | 21481299 |   100.00 | Using where             |
+----+--------------+--------------------+------------+--------+----------------
---------------------------------+---------------------+---------+--------------
-----------------+----------+----------+-------------------------+
3 rows in set, 1 warning (0.00 sec)
```


In the TREE format that is

```
| -> Nested loop antijoin  (cost=46.1e+12 rows=461e+12)

   -> Table scan on messages  (cost=2.3e+6 rows=21.5e+6)
   -> Single-row index lookup on <subquery2> using <auto_distinct_key> (message
_id=messages.id)  (cost=4.32e+6..4.32e+6 rows=1)
       -> Materialize with deduplication  (cost=4.32e+6..4.32e+6 rows=21.5e+6)
           -> Filter: (message_recipients.message_id is not null)  (cost=2.17e+
6 rows=21.5e+6)
               -> Filter: (message_recipients.deleted_at is null)  (cost=2.17e+
6 rows=21.5e+6)
                   -> Table scan on message_recipients  (cost=2.17e+6 rows=21.5
e+6)
```

You see that it needs a materialized view. `message_recipients` is scanned without an index.

But lets try another better understood construct.

Using NOT EXISTS

 This is how we can define this with baby_squeel

```
scope :stale, -> do
    where.has { not_exists(MessageRecipient.not_deleted.select(:id).where.has { message_id == BabySqueel[:messages].id }) }
end

```

The resulting SQL is

```
SELECT `messages`.* FROM `messages` WHERE NOT EXISTS(SELECT `message_recipients`.`id` FROM `message_recipients` WHERE `message_recipients`.`deleted_at` IS NULL AND `message_recipients`.`message_id` = `messages`.`id`)
```

This has exactly the same result and exactly the same optimizer explanation so still not ideal. MySQL is documented to convert NOT EXISTS in anti-join but in this case, perhaps because of the additional `deleted_at` condition, it doesn't do so.

Lets assume that a future version of MySQL might be able to optimize this properly. In the mean time though, let us help it but instead

Using LEFT OUTER JOIN

We can define the scope again with the help of baby_squeel

```
scope :stale, -> do
joining { recipients.on((recipients.message_id == id) & recipients.sift(:not_deleted)).outer }.
      where.has { BabySqueel[:message_recipients].id == nil }
```

Basically we LEFT OUTER JOIN the `recipients` association which has a class `MessageRecipient`. We only JOIN ON recipients which are `not_deleted`. But then we filter only records that do not have a joined message recipient.

That's how MySQL recognizes that it can use an antijoin.

We use `BabySqueel[:message_recipients].id` instead of `recipients.id` because it translates to something wrong, maybe a bug or a deficiency of baby_squeel.

The resulting SQL is

```
SELECT `messages`.* FROM `messages` LEFT OUTER JOIN `message_recipients` ON
message_recipients`.`message_id` = `messages`.`id` AND `message_recipients`.`deleted_at` IS
NULL WHERE `message_recipients`.`id` IS NULL
```

Now optimizer explanation is

```
+----+-------------+--------------------+------------+------+-------------------
------------------------------+-------------------------------------------------
+---------+-------------------------------+----------+----------+---------------
----------+
| id | select_type | table              | partitions | type | possible_keys     
                             | key                                             
| key_len | ref                           | rows     | filtered | Extra         
         |
+----+-------------+--------------------+------------+------+-------------------
------------------------------+-------------------------------------------------
+---------+-------------------------------+----------+----------+---------------
----------+
|  1 | SIMPLE      | messages           | NULL       | ALL  | NULL              
                             | NULL                                            
| NULL    | NULL                          | 21465027 |   100.00 | NULL          
         |
|  1 | SIMPLE      | message_recipients | NULL       | ref  | index_message_reci
pients_on_message_id_and_kind | index_message_recipients_on_message_id_and_kind
| 8       | system_enterprise.messages.id |        1 |    10.00 | Using where; N
ot exists |
+----+-------------+--------------------+------------+------+-------------------
------------------------------+-------------------------------------------------
+---------+-------------------------------+----------+----------+---------------
----------+
2 rows in set, 1 warning (0.00 sec)
```

In TREE format

```
-> Filter: (message_recipients.id is null)  (cost=9.87e+6 rows=2.16e+6)
   -> Nested loop antijoin  (cost=9.87e+6 rows=2.16e+6)
       -> Table scan on messages  (cost=2.3e+6 rows=21.5e+6)
       -> Filter: (message_recipients.deleted_at is null)  (cost=0.252 rows=0.1
01)
           -> Index lookup on message_recipients using index_message_recipients
_on_message_id_and_kind (message_id=messages.id)  (cost=0.252 rows=1.01)
```

It now uses index instead of a table scan on `message_recipients`. And overall SIMPLE select types, everything much simpler. On my server and data, the query works twice as fast as the other ones.

Excuse me for poor formatting. I hopefully get back to fix it later. I thought it may save someones time.

No comments:

Post a Comment