Subqueries: the new strategy for "NULL IN (SELECT …)"

I hope this is my last post about this topic. It seems we’ve resolved all of the issues and I’ll now describe the user-visible consequences.
To recall, we’re talking about subquery predicates in form

 (oe1, ..., oeN) [NOT] IN (SELECT ie1, ..., ieN FROM ... )

that are located in a context where it matters if the predicate’s result is NULL or FALSE. The name “oe” stands for “outer expression”, ie stands for “inner expression”.

MySQL evaluates queries “from outside to inside”, i.e. we first get the values of (oe1, .. oeN) and then we run the subquery and capture the rows it produces. An apparent and very useful optimization is to “inform” the subquery that we’re looking only for rows that have “ie1=oe1”, “ie2=oe2” and so on. This is done by injecting appropriate equalities into subquery’s WHERE (or HAVING) clause. That is,

 (oe1, ..., oeN) [NOT] IN (SELECT ie1, ..., ieN FROM ... )

becomes

  EXISTS (SELECT 1 /* ie1, ..., ieN */
           FROM ... WHERE subquery_where AND
                          oe1=ie1 AND
                          oe2=ie2 AND
                          ...
                          oeN=ieN))

However, this conversion is only valid if we ignore possible NULL values. If some of the iek can be NULL, then we need to use oek=iek OR iek IS NULL instead. I’ve covered this case in detail here in NULL problem in the right part section.

Correct handling of cases where some oek IS NULL requires more radical changes. We’ve just made those changes and here they are:

The new strategy

According to SQL’s interpretation of NULL as “unknown value”,

  NULL IN (non-empty-list-of-some-values) = NULL
  NULL IN () = FALSE

So, when we want to evaluate

  NULL IN (SELECT ie FROM ... )

we need to run the SELECT and see if it will produce any rows. Note that we need to run the original SELECT, without any injected equalities mentioned in the previous section.

On the other hand, it is absolutely essential to have

  not_null_oe IN (SELECT ie FROM ...)

converted to

  EXISTS (SELECT 1 /* ie1 */ FROM ... WHERE ie1=not_null_oe ...)

If we don’t do this, subqueries will be terribly slow. We’ve solved this “inject or not inject” dilemma by wrapping the injected conditions into triggers. A subquery

  (oe1, ..., oeN) [NOT] IN (SELECT ie1, ..., ieN FROM ... )

is converted into

  EXISTS (SELECT 1 /* ie1, ..., ieN */
           FROM ... WHERE subquery_where AND
                          trigcond(oe1=ie1) AND
                          trigcond(oeN=ieN) AND
                          ...
         )

where each trigcond(X) is a special “magic” function defined as:

  trigcond(X) := X    when the "linked" outer expression oe_i is not NULL
  trigcond(X) := TRUE when the "linked" outer expression oe_i is NULL

Equalities that are wrapped into trigcond() function are not first class predicates for the query optimizer. Most optimizations cannot deal with predicates that may be turned on and off at query execution time, so they assume any trigcond(X) to be unknown function and ignore it. At the moment, triggerered equalities can be used by those optimizations:

  1. Ref-optimizer: trigcond(X=Y [OR Y IS NULL]) can be used to construct ref, eq_ref or ref_or_null table accesses.
  2. Index lookup-based subquery execution engines: trigcond(X=Y) can be used to construct unique_subquery or index_subquery access.
  3. Table condition generator: if the subquery is a join of several tables, triggered condition will be checked as soon as possible.

When the optimizer uses triggered condition to create some kind of index lookup-based access (#1 and #2 in the above list), it must have a strategy for the case when the condition is turned off. This “Plan B” strategy is always the same – do a full table scan. In EXPLAIN the plan B shows up as “Full scan on NULL key” in the “Extra” column:

mysql> explain select t1.col1, t1.col1 IN (select t2.key1 from t2 where t2.col2=t1.col2) from t1
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: t1
        ...
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: t2
         type: index_subquery
possible_keys: key1
          key: key1
      key_len: 5
          ref: func
         rows: 2
        Extra: Using where; Full scan on NULL key

And if you run EXPLAIN EXTENDED …; SHOW WARNINGS you can see the triggered condition:

*************************** 2. row ***************************
  Level: Note
   Code: 1003
Message: select `test`.`t1`.`col1` AS `col1`,<in_optimizer>(`test`.`t1`.`col1`,<exists>(<
index_lookup>(<cache>(`test`.`t1`.`col1`) in t2 on key1 checking NULL where (`test`.`t2`.
`col2` = `test`.`t1`.`col2`) having trigcond(<is_not_null_test>(`test`.`t2`.`key1`))))) AS
`t1.col1 IN (select t2.key1 from t2 where t2.col2=t1.col2)` from `test`.`t1`

Performance implications

The first apparent implication is that NULL IN (SELECT …) now may cause full table scans (slow!) where it previously did not. This is the price to pay for correct results.
For multi-table subqueries the execution of NULL IN (SELECT …) is going to be particularly slow because the join optimizer doesn’t optimize for the case when outer expression is NULL. It assumes that subquery evaluations with NULL on the left side are very rare, even if there is statistics that says otherwise

On the other hand, if you have left expression that may be NULL but actually never is, you will not lose any speed.

The practical hints are

  • A column must be declared as NOT NULL if it really is. This is important for the other parts of the query optimizer too.
  • If you don’t really need the correct NULL/FALSE answer, you can easily avoid the slow execution path: just replace
         oe IN (SELECT ie FROM ...)
    

    with

         (oe IS NOT NULL) AND (oe IN (SELECT ie FROM ...))
    

    and NULL IN (SELECT …) will never be evaluated because MySQL stops evaluating AND parts as soon as the answer is clear.

The goal of this new strategy was to improve compliance and not speed. However we’ve had an intent to not make anything unneccessarily slow. If something became slower for you please file a bug, perhaps we’ll be able to do something about it.

1 Comment

  1. Hi Sergey.

    Really thanks for good articles.
    I have two question about NULL NOT IN (subquery) pattern query.

    1. Eventually, “NULL NOT IN (subquery)” where condition never generate result row.
    So I couldn’t understand why MySQL use trigcond function.
    2. In “Performance implications”:
    >> The first apparent implication is that NULL IN (SELECT …) now may cause full table scans (slow!)
    >> where it previously did not. This is the price to pay for correct results.
    In my test query in which subquery scans full table of 7M rows.
    But The query is not getting slow down.

    ————————————————–
    Here’s my test scenario
    ————————————————–
    CREATE TABLE `c` (
    `cid` bigint(20) NOT NULL,
    `rid` bigint(20) NOT NULL,
    `mid` bigint(20) NOT NULL,
    PRIMARY KEY (`cid`),
    KEY `ix_rid` (`rid`),
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
    –> This table contains 7 Million rows

    CREATE TABLE `r` (
    `rid` bigint(20) DEFAULT NULL
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

    INSERT INTO r VALUES (1), (2), (NULL);

    EXPLAIN select sql_no_cache r.rid
    from r
    where r.rid is null or r.rid not in (select rid from c);

    EXPLAIN select sql_no_cache r.rid
    from r
    where r.rid not in (select rid from c);

    Above two queries’ execution plain is the same like below.
    +—-+——————–+———+—————-+—————+—————+———+——+——+————————————————-+
    | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
    +—-+——————–+———+—————-+—————+—————+———+——+——+————————————————-+
    | 1 | PRIMARY | r | ALL | NULL | NULL | NULL | NULL | 3 | Using where |
    | 2 | DEPENDENT SUBQUERY | c | index_subquery | ix_rid | ix_rid | 8 | func | 1 | Using index; Using where; Full scan on NULL key |
    +—-+——————–+———+—————-+—————+—————+———+——+——+————————————————-+

    And also performance is the same.
    And MySQL status variables also said “Handler read not so many record” in “Handler_read_rnd_next”
    Do they run “select * from c limit 1” for “NULL NOT IN (subquery)”‘s subquery ?
    What do I wrong?

    Thanks.

    Like

Leave a Reply to Tony Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s