More on cost-based choice between subquery Materialization and IN->EXISTS

In my previous post, I shared my finding that MySQL 5.6.7 does not make a cost-based choice between Materialization and IN-to-EXISTS strategies for subqueries.

It turns out I was wrong. As Guilhem Bichot has blogged here, he has implemented cost-based choice between Materialization and IN->EXISTS in MySQL 5.6.7. Igor Babaev also wrote about the topic, and covered the reasons I didn’t see the feature – it isn’t mentioned in the documentation, development process at Oracle is quite closed, and the feature didn’t work for a basic example that I have tried.

Let’s try to focus on the technical part of it. Looking at the source code, I see that MySQL and MariaDB’s implementations try to achieve a similar effect, but they are different – neither one is a copy of the other. This makes it interesting to do a comparison.

First, there is the example from my previous blog post:

MySQL [test]> explain select col1, col1 in (select col2 from one_k_rows_tbl) from one_row_tbl;
+----+-------------+----------------+------+---------------+------+---------+------+------+-------+
| id | select_type | table          | type | possible_keys | key  | key_len | ref  | rows | Extra |
+----+-------------+----------------+------+---------------+------+---------+------+------+-------+
|  1 | PRIMARY     | one_row_tbl    | ALL  | NULL          | NULL | NULL    | NULL |    1 | NULL  |
|  2 | SUBQUERY    | one_k_rows_tbl | ALL  | NULL          | NULL | NULL    | NULL | 1000 | NULL  |
+----+-------------+----------------+------+---------------+------+---------+------+------+-------+

Here, the upper select reads only one row, which means that the subquery predicate will need to be evaluated one time. Now, if you need to execute the subquery one time, the question is which of the following is cheaper:

  1. just execute it (IN->EXISTS), or
  2. execute it, and dump its output into a temporary table with a unique key (Materialization)

Apparently, the the first should be cheaper, yet MySQL 5.6.7 chooses the second. I’ve filed BUG#67511 about this, and posted there some ideas about the reason why the poor plan was chosen.

The second difference was found by Igor Babaev through experiment, and I confirmed it by reading the code. Suppose, there is a join, and a subquery:

SELECT *
  FROM Country, City
  WHERE City.Country = Country.Code AND
        Country.SurfaceArea  10 AND
        (Country.Name IN (select Language from CountryLanguage where Percentage > 50) OR
         Country.name LIKE '%Island%');

and query plan is as follows (this is output from MySQL 5.6):

+----+-------------+-----------------+-------+---------------------+------------+---------+-----------------+------+--------------------------+
| id | select_type | table           | type  | possible_keys       | key        | key_len | ref             | rows | Extra                    |
+----+-------------+-----------------+-------+---------------------+------------+---------+-----------------+------+--------------------------+
|  1 | PRIMARY     | Country         | ALL   | PRIMARY,SurfaceArea | NULL       | NULL    | NULL            |  207 | Using where              |
|  1 | PRIMARY     | City            | ref   | Country             | Country    | 3       | j1.Country.Code |    7 | NULL                     |
|  2 | SUBQUERY    | CountryLanguage | range | Percentage,Language | Percentage | 4       | NULL            |  168 | Using where; Using index |
+----+-------------+-----------------+-------+---------------------+------------+---------+-----------------+------+--------------------------+

Here, the subquery is attached to table Country, that is, it is executed 207 times. MariaDB will calculate the number correctly. MySQL on the other hand, will assume the subquery is evaluated as many times as many record combinations are there in the parent select, in this example 207 * 7 = 1449 times. As the above EXPLAIN shows, MySQL 5.6.7 will use Materialization (reasonable choice if you expect 1449 subquery re-executions). MariaDB will use a correct estimate of 207 subquery re-executions and will pick IN->EXISTS. Query plans are already different. The tables in this example are small, so there’s no measurable difference in execution speed. I suppose one could build a bigger example, where the speed difference can be actually observed.

Finally, the 3rd difference is that

  • MariaDB will optimize the subquery before the outer query
  • MySQL will optimize the subquery after the outer query

This doesn’t have any apparent user-visible effects currently, yet Timour Katchaounov has spent substantial amount of time to make MariaDB optimize the subquery first. Why did he do it? I’ll cover in subsequent blog posts.

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s