More MySQL 6.0 news: next subquery optimizations WL pushed

Three days ago I’ve finally managed to push the code for WL#3985 “Subquery optimization: smart choice between semi-join and materialization” into MySQL 6.0. I missed the clone-off date so it won’t be in the upcoming MySQL 6.0.9 release, the only way to get it before the next 6.0.10 release is from the lp:mysql-server/6.0 bazaar repository.

What’s new in the push

Before WL#3985, 6.0’s subquery optimization had these three deficiencies:

  1. For semi-join (see cheatsheet for definition) subqueries, you had to make a choice between having the optimizer use materialization or all other strategies. The default behavior was not to use materialization, you could only get it by setting a server variable to disable all other strategies.
  2. The choice among other strategies (FirstMatch, DuplicateWeedout, LooseScan) wasn’t very intelligent – roughly speaking, the optimizer would first pick a join order as if there were only inner joins, and then remember that some of them are actually semi-joins and try to find how it can resolve semi-joins with the picked join order.
  3. Materialization only worked in the outer-to-inner fashion, that is, if you got a query like
    select * from people where name in (select owner from aircraft) it would still scan the people and make lookups into a temporary table of aircraft owners. It was not possible to make it scan the temptable of aircraft owners and make lookups into people.

WL#3985 fully addresses #1 and #2, and partially addresses #3. That is, now

  • Semi-join subqueries can use Materialization in an inner-to-outer fashion
  • Join optimizer is aware of existence of semi-joins and makes a fully automatic, cost-based choice between FirstMatch, DuplicateWeedout, LooseScan, inner-to-outer and outer-to-inner variants of Materialization.

This is expected to be a considerable improvement. The most common class of subqueries,
SELECT ... WHERE expr IN (SELECT ... w/o GROUPing/UNIONs/etc) AND ...
is now covered by a reasonably complete set of execution strategies and the optimizer is expected to have the capability to choose a good strategy for every case.

Possible gotchas, and we’re looking for input

I can’t state that the subquery optimizer does have the capability to pick a good plan because we haven’t done much experiments with the subquery cost model yet. We intend to do some benchmarking, but will also very much appreciate any input on how does the subquery optimizer behave for real-world queries. The code should be reasonably stable now – there are only three known problems, all of which are not very popular edge cases:

  • LEFT JOINs. You may get wrong query results when the subquery or parent subquery use left joins.
  • “Grandparent” correlation. A query with a semi-join child subquery which has a semi-join grandchild subquery which refers to a column in the top-level select may produce wrong query plans/results under certain circumstances.
  • Different datatypes. You may get wrong query results of queries that have col1 IN (SELECT col2) where col1 and col2 are of different types (which should not happen too often in practice)

If you have subqueries with LEFT JOINs, please let us know also, because so far all LEFT JOIN+subquery cases we have were generated by the random query generator, certain properties of MySQL codebase make it difficult to make outer joins work with semi-joins, and if we don’t get any real-world LEFT JOIN examples, chances are we will disable subquery optimizations if there’s LEFT JOIN in the parent select, or in the subquery, or in either case.

Slides: New subquery optimizations in MySQL 6.0 (new revision)

I’m now at MySQL Conference and have just finished my New Subquery Optimizations in MySQL 6.0 talk. If you’re eager to get the slides, here is the link to the pdf file and a mirror.

The talk has the same name as this MySQL University session but the slides have been thoroughly re-worked, and also there is new content:

  • More detailed explanation of subquery handling in 4.1/5.x
  • More reliable/justified benchmark numbers
  • Observations about subquery demographics
  • An attempt at comparative analysis of how MySQL’s subquery strategies compare to PostgreSQL’s
    • And a slide #28 that explains why we’re looking at PostgreSQL all the time

Have a nice viewing. And, if you get subqueries that should be fast but aren’t, we’re now actively looking at every subquery performance bug we get and will appreciate any input.

Correlated semi-join subqueries and PostgreSQL

The work on subquery optimizations goes on, and we’re already curious how the assortment of new 6.0 subquery optimizations compares to what is found in other DBMSes. MySQL’s new strategies are not DBMS theory breakthroughs, similar ideas have been implemented in other DBMSes, but when you take this list of strategies and put them into this kind of optimizer, the result has no direct equivalents in any other DBMS (or at least we believe so).

The first thing we did was to take a look at PostgreSQL as it is easily available and seems to have at least decent subquery handling (or even better than decent, I have not heard much complaints). And the first interesting difference was handling of correlated subqueries. With exception of materialization, MySQL’s new subquery strategies do not care if the subquery is correlated or not.

For example, let’s look at table pullout: First let’s try an uncorrelated subquery:

mysql> explain select * from tbl1 where col1 in (select primary_key from tbl2 where col2  show warnings\G
...
Message: select `test2`.`tbl1`.`col1` AS `col1`,`test2`.`tbl1`.`col2` AS `col2` from (`test2`.`tbl2`) join `test2`.`tbl1` 
where ((`test2`.`tbl1`.`col1` = `test2`.`tbl2`.`primary_key`) and (`test2`.`tbl2`.`col2` < 20))

and we see that it has been converted into an inner join (indicated by the part marked in red). Now let’s make the subquery correlated:

mysql> explain select * from tbl1 where col1 in (select primary_key from tbl2 where
    -> col2  show warnings\G
...
Message: select `test2`.`tbl1`.`col1` AS `col1`,`test2`.`tbl1`.`col2` AS `col2` from (`test2`.`tbl2`) join `test2`.`tbl1`
where ((`test2`.`tbl2`.`primary_key` = `test2`.`tbl1`.`col1`) and (`test2`.`tbl2`.`col2` = `test2`.`tbl1`.`col2`) and
 (`test2`.`tbl1`.`col2` < 20))

The query execution plans are different (because of the extra equality) but we see that subquery was converted into join in both cases.

Now let’s try PostgreSQL: first, uncorrelated subquery:

psergey=# explain select * from tbl1 where col1 in (select primary_key from tbl2 where col2   Index Scan using tbl1_col1 on tbl1  (cost=0.00..56.66 rows=600 width=8)
   ->  Index Scan using tbl2_pkey on tbl2  (cost=0.00..503.25 rows=12 width=4)
         Filter: (col2 < 20)

Ok, a decent plan. Now, let’s make the subquery correlated:

psergey=# explain select * from tbl1 where col1 in (select primary_key from tbl2 where
psergey(# col2   Seq Scan on tbl2  (cost=0.00..200.00 rows=1 width=4)
           Filter: ((col2 < 20) AND (col2 = $0))

and we can see that PostgreSQL starts using the “naive” approach, where the subquery predicate is evaluated using a direct, straightforward fashion and is not used for any optimizations.

So the first impression is that MySQL kinda wins this round. But wait, if you look at user subquery case analysis here, you’ll see that the vast majority of semi-join subqueries are uncorrelated. I wonder if PG folks at some point in the past have made a conscious decision not to bother optimizing correlated subqueries.

For uncorrelated subqueries, it’s different. Our first impression is that PostgreSQL has a powerful tool – hash join and its subquery variants, which it uses to get decent performance in wide variety of cases. MySQL does not have hash join (Hi Timour? Kaamos?), but the new subquery optimizations can have kinda-similar behavior. It is not clear yet whether they’ll be a good subquery hash join substitute. I intend to post the analysis here once I figure it out. Stay tuned.

Slides: New subquery optimizations in MySQL 6.0

A bunch of (hopefully) self-explanatory slides about new subquery optimizations in MySQL 6.0 is available here (UPDATE: here’s a working link). The slides are from this MySQL University session, so there was an audio stream but there were some difficulties with it and it is not available now.

If you miss the audio, the next opportunity to hear about this topic is the New Subquery Optimizations in MySQL 6.0 session I’m giving at the MySQL User Conference on April, 15th.

If you don’t miss the audio, please check out the mentioned session anyway – we already have more content than there are in the slides, and I expect to get even more till the MySQL UC.

 

Observations about subquery use cases

As I wrote earlier, we (me and Ranger) have done an assessment of the impact of new 6.0 subquery optimizations. First, we’ve done a search for performance issues in MySQL’s bug database. The raw list is here, and the here is a summary chart:

Search for customer issues in MySQL’s issue tracker system has produced a similar picture (raw list):

Search in DBT-{1,2,3,4} open source benchmark suites produced a radically different result though (raw data is here):

The samples are small but I think one can already conclude that the benchmarks and user cases have different kinds of subqueries. The cause of the discrepancy is not clear. Are people’s applications different from what the benchmarks simulate? If yes, what is the benchmark that simulates the apps our users are running?

A dumb search for subqueries in random open source applications using grep didn’t produce much. Again, no idea why, either I was looking at the wrong applications, or it could be that applications that use subqueries do not have SQL statements embedded in their code so one can’t find them with grep, or something else.

So far the outcome is that it’s nice that the new optimizations [will] capture a good fraction of real world cases, but there is no benchmark we could run to check or demonstrate this. Well, I’d still prefer this situation over the the opposite.

Subquery optimizations in MySQL 6.0

I’ve finally found time to put together a couple of wiki pages describing what work we’re doing for subquery optimizations in MySQL 6.0:

  • The Subquery_Works page has an overview of what we’ve done so far, we’re doing now and our future plans.
  • The 6.0 Subquery Optimization Cheatsheet has a short, easy-to-read description of what has been released in MySQL 6.0.3. That should be enough if you just want to get the new version and see if your subquery is/can be faster now or not. If you want to know more details, the nearest occasion is my MySQL University session on February, 28.
  • I’ve done a quick preliminary assessment of the impact of new optimizations. The raw results and first conclusions are here on the 6.0 Subquery Optimization Benchmarks page. We will be doing more analysis and I intend to post more observations based on the already collected data, stay tuned.

That’s everything I have on subqueries so far, but as I’ve already mentioned more is on the way.

Subqueries: the new strategy for "NULL IN (SELECT …)"

I hope this is my last post about this topic. It seems we’ve resolved all of the issues and I’ll now describe the user-visible consequences.
To recall, we’re talking about subquery predicates in form

 (oe1, ..., oeN) [NOT] IN (SELECT ie1, ..., ieN FROM ... )

that are located in a context where it matters if the predicate’s result is NULL or FALSE. The name “oe” stands for “outer expression”, ie stands for “inner expression”.

MySQL evaluates queries “from outside to inside”, i.e. we first get the values of (oe1, .. oeN) and then we run the subquery and capture the rows it produces. An apparent and very useful optimization is to “inform” the subquery that we’re looking only for rows that have “ie1=oe1”, “ie2=oe2” and so on. This is done by injecting appropriate equalities into subquery’s WHERE (or HAVING) clause. That is,

 (oe1, ..., oeN) [NOT] IN (SELECT ie1, ..., ieN FROM ... )

becomes

  EXISTS (SELECT 1 /* ie1, ..., ieN */
           FROM ... WHERE subquery_where AND
                          oe1=ie1 AND
                          oe2=ie2 AND
                          ...
                          oeN=ieN))

However, this conversion is only valid if we ignore possible NULL values. If some of the iek can be NULL, then we need to use oek=iek OR iek IS NULL instead. I’ve covered this case in detail here in NULL problem in the right part section.

Correct handling of cases where some oek IS NULL requires more radical changes. We’ve just made those changes and here they are:

The new strategy

According to SQL’s interpretation of NULL as “unknown value”,

  NULL IN (non-empty-list-of-some-values) = NULL
  NULL IN () = FALSE

So, when we want to evaluate

  NULL IN (SELECT ie FROM ... )

we need to run the SELECT and see if it will produce any rows. Note that we need to run the original SELECT, without any injected equalities mentioned in the previous section.

On the other hand, it is absolutely essential to have

  not_null_oe IN (SELECT ie FROM ...)

converted to

  EXISTS (SELECT 1 /* ie1 */ FROM ... WHERE ie1=not_null_oe ...)

If we don’t do this, subqueries will be terribly slow. We’ve solved this “inject or not inject” dilemma by wrapping the injected conditions into triggers. A subquery

  (oe1, ..., oeN) [NOT] IN (SELECT ie1, ..., ieN FROM ... )

is converted into

  EXISTS (SELECT 1 /* ie1, ..., ieN */
           FROM ... WHERE subquery_where AND
                          trigcond(oe1=ie1) AND
                          trigcond(oeN=ieN) AND
                          ...
         )

where each trigcond(X) is a special “magic” function defined as:

  trigcond(X) := X    when the "linked" outer expression oe_i is not NULL
  trigcond(X) := TRUE when the "linked" outer expression oe_i is NULL

Equalities that are wrapped into trigcond() function are not first class predicates for the query optimizer. Most optimizations cannot deal with predicates that may be turned on and off at query execution time, so they assume any trigcond(X) to be unknown function and ignore it. At the moment, triggerered equalities can be used by those optimizations:

  1. Ref-optimizer: trigcond(X=Y [OR Y IS NULL]) can be used to construct ref, eq_ref or ref_or_null table accesses.
  2. Index lookup-based subquery execution engines: trigcond(X=Y) can be used to construct unique_subquery or index_subquery access.
  3. Table condition generator: if the subquery is a join of several tables, triggered condition will be checked as soon as possible.

When the optimizer uses triggered condition to create some kind of index lookup-based access (#1 and #2 in the above list), it must have a strategy for the case when the condition is turned off. This “Plan B” strategy is always the same – do a full table scan. In EXPLAIN the plan B shows up as “Full scan on NULL key” in the “Extra” column:

mysql> explain select t1.col1, t1.col1 IN (select t2.key1 from t2 where t2.col2=t1.col2) from t1
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: t1
        ...
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: t2
         type: index_subquery
possible_keys: key1
          key: key1
      key_len: 5
          ref: func
         rows: 2
        Extra: Using where; Full scan on NULL key

And if you run EXPLAIN EXTENDED …; SHOW WARNINGS you can see the triggered condition:

*************************** 2. row ***************************
  Level: Note
   Code: 1003
Message: select `test`.`t1`.`col1` AS `col1`,<in_optimizer>(`test`.`t1`.`col1`,<exists>(<
index_lookup>(<cache>(`test`.`t1`.`col1`) in t2 on key1 checking NULL where (`test`.`t2`.
`col2` = `test`.`t1`.`col2`) having trigcond(<is_not_null_test>(`test`.`t2`.`key1`))))) AS
`t1.col1 IN (select t2.key1 from t2 where t2.col2=t1.col2)` from `test`.`t1`

Performance implications

The first apparent implication is that NULL IN (SELECT …) now may cause full table scans (slow!) where it previously did not. This is the price to pay for correct results.
For multi-table subqueries the execution of NULL IN (SELECT …) is going to be particularly slow because the join optimizer doesn’t optimize for the case when outer expression is NULL. It assumes that subquery evaluations with NULL on the left side are very rare, even if there is statistics that says otherwise

On the other hand, if you have left expression that may be NULL but actually never is, you will not lose any speed.

The practical hints are

  • A column must be declared as NOT NULL if it really is. This is important for the other parts of the query optimizer too.
  • If you don’t really need the correct NULL/FALSE answer, you can easily avoid the slow execution path: just replace
         oe IN (SELECT ie FROM ...)
    

    with

         (oe IS NOT NULL) AND (oe IN (SELECT ie FROM ...))
    

    and NULL IN (SELECT …) will never be evaluated because MySQL stops evaluating AND parts as soon as the answer is clear.

The goal of this new strategy was to improve compliance and not speed. However we’ve had an intent to not make anything unneccessarily slow. If something became slower for you please file a bug, perhaps we’ll be able to do something about it.

Subqueries: NULLs and IN/=ANY problem fixed

A while ago I wrote about problem with NULLs and IN/=ANY subqueries MySQL had. I was completely correct when I wrote that the fix won’t be simple. It took 3 bug entries (BUG#8804, BUG#24085, BUG#24127), several review iterations by Igor, and the final patch is around 2,300 lines long.

The good news is that this patch solves the problem completely, and it is already in the 5.0.36 tree. The documentation is not yet updated, doing that is now on my todo. There is quite a lot to document: we’ve had to introduce “switchable access methods”, where a table accessed using ref is sometimes accessed using full table scan. (for the impatient: no, the new access method is not not called ref_or_null_or_all :-), it is still ref[_or_null] but with “Full scan on NULL key” in the “Extra” column).

I’ll post here when the docs become available.

IN/=ANY Subqueries NULL woes

NULL values are suprisingly hazardous for MySQL’s way of optimizing IN subqueries. MySQL takes the ‘=’ from ‘=ANY’ and tries to push it down into the subquery, essentially changing IN subquery into EXISTS subquery. That is, a subquery like

outer_expr IN (SELECT inner_expr FROM ... WHERE subq_where)

is converted to:

EXISTS (SELECT 1 FROM ... WHERE subq_where AND outer_expr=inner_expr)

and then MySQL can use the pushed-down equality to limit number of rows it has to look through when running the subquery. This “pushdown” works as long as outer_expr and inner_expr can’t be NULL or you don’t care whether subquery result is NULL or FALSE (MySQL figures that you don’t care if the subquery is a part of OR or AND expression in the WHERE clause).When you start caring about producing correct NULL or FALSE results, problems start coming from left and right, literally:

NULL problem in the right part

Suppose outer_expr is not a NULL value but the subquery doesn’t produce a row
such that outer_expr=inner_expr.
Then “non_null_outer_expr IN (SELECT …)” evaluates to

  • FALSE, if the SELECT produces only non-null values, or produces nothing
  • NULL, if the SELECT produces (among others) a row with inner_expr IS NULL.

The “look for rows with inner_expr=outer_expr” approach is not valid anymore. You have to use a more complicated “look for rows with inner_expr=outer_expr, and if nothing is found, look for rows with inner_expr IS NULL”. Roughly speaking, the subquery can be converted to:

EXISTS (SELECT 1 FROM ... WHERE subq_where AND
(outer_expr=inner_expr OR
inner_expr IS NULL))

The need to evaluate this is the reason why MySQL has ‘ref_or_null’ access method:

mysql> EXPLAIN
-> SELECT outer_expr IN (SELECT t2.maybe_null_key
->                       FROM t2, t3 WHERE ...)
-> FROM t1;
*************************** 1. row ***************************
id: 1
select_type: PRIMARY
table: t1
...
*************************** 2. row ***************************
id: 2
select_type: DEPENDENT SUBQUERY
table: t2
type: ref_or_null
possible_keys: maybe_null_key
key: maybe_null_key
key_len: 5
ref: func
rows: 2
Extra: Using where; Using index
...

Subqery-specific access methods, unique_subquery and index_subquery, too have or-null variants. They are not visible in EXPLAIN output, you need to use EXPLAIN EXTENDED and look through the produced warnings:

mysql> EXPLAIN EXTENDED
-> SELECT outer_expr IN (SELECT maybe_null_key FROM t2) FROM t1;
*************************** 1. row ***************************
id: 1
select_type: PRIMARY
table: t1
...
*************************** 2. row ***************************
id: 2
select_type: DEPENDENT SUBQUERY
table: t2
type: index_subquery
possible_keys: maybe_null_key
key: maybe_null_key
key_len: 5
ref: func
rows: 2
Extra: Using index
2 rows in set, 1 warning (0.00 sec)

mysql> show warnings;
*************************** 1. row ***************************
Level: Note
Code: 1003
Message: select (`test`.`t1`.`outer_expr`,
(((`test`.`t1`.`outer_expr`) in t2 on
maybe_null_key checking NULL))) AS `outer_expr IN (SELECT
maybe_null_key FROM t2)` from `test`.`t1`

The “OR … IS NULL” makes things slightly more complicated (and some optimizations within the subquery become impossible), but generally this is tolerable.

NULL hazard in the left part

The situation is much worse when you get NULL for outer_expr. According to NULL handling rules, NULL IN (SELECT inner_expr …) should evaluate to

  • FALSE, if the SELECT produces no rows.
  • NULL, if the SELECT produces any rows.

We’ll need to be able to check if the SELECT has produced any rows at all, so we can’t add “outer_expr = inner_expr” into the subquery. This is a problem, because lots of real world subqueries become very slow if we don’t push the equality down. At the moment, MySQL’s choice is to be fast rather than correct, and you may get FALSE where you should have gotten NULL. There seems to be no easy solution – essentially you need to have two different ways to execute the subquery depending on what is the value of left_expr is.

Me, Igor and Evgen are working on a fix. Hopefully I’ll be able post a resolution update soon.