EXPLAIN UPDATE/DELETE/INSERT in MySQL and MariaDB

MySQL 5.6 added support for EXPLAIN INSERT/UPDATE/DELETE. MySQL 5.7 made some improvements to EXPLAIN code. Also, 5.7 added support for EXPLAIN FOR CONNECTION, which looks very similar to MariaDB’s SHOW EXPLAIN. I was working on putting EXPLAIN INSERT/UPDATE/DELETE into MariaDB 10.0, so I’ve dug up some details and thought I’d share them:

EXPLAIN UPDATE/DELETE

EXPLAIN UPDATE looks like regular EXPLAIN SELECT:

MySQL [dbt3sf1]> explain update orders set o_orderpriority='1-URGENT' where o_orderDATE='1998-07-21';
+----+-------------+--------+-------+---------------+---------------+---------+-------+------+-------------+
| id | select_type | table  | type  | possible_keys | key           | key_len | ref   | rows | Extra       |
+----+-------------+--------+-------+---------------+---------------+---------+-------+------+-------------+
|  1 | SIMPLE      | orders | range | i_o_orderdate | i_o_orderdate | 4       | const |  628 | Using where |
+----+-------------+--------+-------+---------------+---------------+---------+-------+------+-------------+

EXPLAIN PARTITIONS shows partitions. EXPLAIN EXTENDED shows the “filtered” column, but doesn’t produce a warning with query after rewrites.

In MySQL 5.7, select_type changes from SIMPLE to UPDATE.

MySQL [dbt3sf1]> explain update orders set o_orderpriority='3-MEDIUM' where o_orderDATE='1998-07-21';
+----+-------------+--------+-------+---------------+---------------+---------+-------+------+-------------+
| id | select_type | table  | type  | possible_keys | key           | key_len | ref   | rows | Extra       |
+----+-------------+--------+-------+---------------+---------------+---------+-------+------+-------------+
|  1 | UPDATE      | orders | range | i_o_orderdate | i_o_orderdate | 4       | const |  628 | Using where |
+----+-------------+--------+-------+---------------+---------------+---------+-------+------+-------------+

for multi-table UPDATEs (or DELETEs), select_type=UPDATE (or DELETE) only for tables that are being updated.

Inside, there are some issues. Best-release-ever still suffers from an old MySQL design where execution path taken by EXPLAIN is different from execution path taken by the actual query. If you have something in two places, it goes out of sync in no time, and I have alreay found one example EXPLAIN will show “Using join buffer” where the execution doesn’t use it (filed as BUG#70553). A shameless plug: in MariaDB 10.0, we use one execution path, so bugs like this one are not possible.

Another peculiarity is Using temporary. You can run a query like this:

MySQL [dbt3sf1]> explain update orders set o_orderDATE='1998-07-22' where o_orderDATE='1998-07-21';
+----+-------------+--------+-------+---------------+---------------+---------+-------+------+------------------------------+
| id | select_type | table  | type  | possible_keys | key           | key_len | ref   | rows | Extra                        |
+----+-------------+--------+-------+---------------+---------------+---------+-------+------+------------------------------+
|  1 | UPDATE      | orders | range | i_o_orderdate | i_o_orderdate | 4       | const |  628 | Using where; Using temporary |
+----+-------------+--------+-------+---------------+---------------+---------+-------+------+------------------------------+

and see Using temporary. This looks weird, why would this query need a temporary table? Running the UPDATE and checking Created_tmp_tables shows that indeed no temporary tables were created. The reason for “Using temporary” is that the UPDATE modifies columns in the index that is used to find rows. When that happens, the optimizer needs to collect record ids in a buffer before doing any updates. But, records are collected in a buffer, not in a temporary table. We in MariaDB decided to denote this with Using buffer instead:

MariaDB [dbt3sf1]> explain update orders set o_orderDATE='1998-07-22' where o_orderDATE='1998-07-21';
+------+-------------+--------+-------+---------------+---------------+---------+------+------+---------------------------+
| id   | select_type | table  | type  | possible_keys | key           | key_len | ref  | rows | Extra                     |
+------+-------------+--------+-------+---------------+---------------+---------+------+------+---------------------------+
|    1 | SIMPLE      | orders | range | i_o_orderdate | i_o_orderdate | 4       | NULL |  628 | Using where; Using buffer |
+------+-------------+--------+-------+---------------+---------------+---------+------+------+---------------------------+

EXPLAIN INSERT

This one is weird – INSERT doesn’t have a query plan, why would one need EXPLAIN for it? One could argue that EXPLAIN INSERT ... SELECT makes sense, as the SELECT part has a query plan. However, using that logic, EXPLAIN CREATE TABLE … SELECT should have been supported also, but it isn’t.

In MySQL 5.6, EXPLAIN INSERT seems to be totally useless:

MySQL [dbt3sf1]> explain insert into orders (o_custkey) values ('32423');
+----+-------------+-------+------+---------------+------+---------+------+------+----------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra          |
+----+-------------+-------+------+---------------+------+---------+------+------+----------------+
|  1 | SIMPLE      | NULL  | NULL | NULL          | NULL | NULL    | NULL | NULL | No tables used |
+----+-------------+-------+------+---------------+------+---------+------+------+----------------+

In MySQL 5.7, it shows the table that we insert into:

MySQL [dbt3sf1]> explain insert into orders (o_custkey) values ('32423');
+----+-------------+--------+------+---------------+------+---------+------+------+-------+
| id | select_type | table  | type | possible_keys | key  | key_len | ref  | rows | Extra |
+----+-------------+--------+------+---------------+------+---------+------+------+-------+
|  1 | INSERT      | orders | ALL  | NULL          | NULL | NULL    | NULL | NULL | NULL  |
+----+-------------+--------+------+---------------+------+---------+------+------+-------+

“type=ALL” is misleading, because no full table scan will take place. The only scenario when EXPLAIN INSERT can be useful is when one is doing an INSERT into partitioned table. In that case, EXPLAIN PARTITIONS will show the result of partition pruning. Note that partition pruning for INSERTs works in a rather peculiar way – depending on the number of rows inserted, it may choose not to perform partition pruning and will show that all partitions will be used.

I think this post is getting long, so I will talk about SHOW EXPLAIN and EXPLAIN FOR CONNECTION in a separate post.

EXPLAIN FORMAT=JSON vs regular EXPLAIN

Valeriy has mentioned that MySQL manual has no documentation about EXPLAIN FORMAT=JSON feature. There was a bug filed for this since September, 2012. I had spent some time looking at EXPLAIN FORMAT=JSON, so I thought I would share my findings.

The first blog post about EXPLAIN FORMAT=JSON was made by Sheeri. She listed these features of FORMAT=JSON:

  • attached_condition and implicit casts
  • used_key_parts field

In addition to those, I was able to find

  • Pushed index condition information
  • More information about sorting/grouping
  • Subquery attachment points information

Let’s look at these in greater detail:

Pushed index condition information

This is very similar to conditions attached to tables – instead of “Using index condition” one can see the pushed condition itself:

explain format=json select * from Country where Name like 'p%' and Population > 3*1000*1000;
  "query_block": {
    "select_id": 1,
    "table": {
      "table_name": "Country",
      "access_type": "range",
      "possible_keys": [
        "Name",
        "Population"
      ],
      "key": "Name",
      "used_key_parts": [
        "Name"
      ],
      "key_length": "52",
      "rows": 12,
      "filtered": 100,
      "index_condition": "(`world`.`Country`.`Name` like 'p%')",
       "attached_condition": "(`world`.`Country`.`Population` > (((3 * 1000) * 1000)))" 
    }
  }

This is very useful. I wish the condition was printed without the over-quoting of every identifier.

More information about sorting/grouping

The old “Using temporary; Using filesort” was replaced with ordering_operation and grouping_operation nodes. The nodes always have using_filesort attribute, and may optionally have using_temporary_table attribute.

explain format=json select * from City order by Population desc limit 1;
*************************** 1. row ***************************
EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "ordering_operation": {
      "using_filesort": true,
      "table": {
        "table_name": "City",
        "access_type": "ALL",
        "rows": 4249,
        "filtered": 100
      }
    }
  }
}

This provides slightly more information than the old “Using temporary; Using filesort” line. In the following example, one can see that “Using temporary” is done to handle GROUP BY, while filesort is used for GROUP BY:

explain format=json
select sum(population) as city_ppl, Country
from City
group by Country 
order by city_ppl desc;
*************************** 1. row ***************************
EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "ordering_operation": {
      "using_filesort": true,
      "grouping_operation": {
        "using_temporary_table": true,
        "using_filesort": false,
        "table": {
          "table_name": "City",
          "access_type": "ALL",
          "rows": 4409,
          "filtered": 100
        }
      }
    }
  }
}

It is also possible to encounter ordering_operation with using_filesort:false, which means that the optimizer relies on the query plan to produce the data in required ordering.

Subquery attachment points

EXPLAIN FORMAT=JSON shows where the subqueries are attached to queries. In tabular form, you see something like this:

+----+--------------------+-----------+------+---------------+------+---------+------+------+-------------+
| id | select_type        | table     | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+--------------------+-----------+------+---------------+------+---------+------+------+-------------+
|  1 | PRIMARY            | Country   | ALL  | Name          | NULL | NULL    | NULL |  165 | Using where |
|  3 | SUBQUERY           | satellite | ALL  | NULL          | NULL | NULL    | NULL | 1041 | NULL        |
|  2 | DEPENDENT SUBQUERY | City      | ALL  | NULL          | NULL | NULL    | NULL | 4409 | Using where |
+----+--------------------+-----------+------+---------------+------+---------+------+------+-------------+

and it is not apparent how subqueries are attached to queries. For basic queries, this is not a problem – just look at the query. However, if the query has VIEWs or FROM clause subqueries, it may be difficult to recognize the query plan structure. In any case, with EXPLAIN FORMAT=JSON, subqueries look like this:

*************************** 1. row ***************************
EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "table": {
      "table_name": "Country",
      "access_type": "range",
      "possible_keys": [
        "Name"
      ],
      "key": "Name",
      "used_key_parts": [
        "Name"
      ],
      "key_length": "52",
      "rows": 2,
      "filtered": 100,
      "index_condition": "(`world`.`Country`.`Name` in ('France','Germany'))"
    },
    "select_list_subqueries": [
      {
        "dependent": true,
        "cacheable": false,
        "query_block": {
          "select_id": 2,
          "table": {
            "table_name": "City",
            "access_type": "ALL",
            "rows": 4409,
            "filtered": 100,
            "attached_condition": "(`world`.`City`.`Country` = `world`.`Country`.`Code`)"
          }
        }
      }
    ]
  }
}

which makes it more apparent where the subquery is attached and how often it is re-executed.

My personal impressions

EXPLAIN FORMAT=JSON looks like a project that was started with good intents but wasn’t finished. I’ve found a couple of bugs by just playing around (1, 2). A number of things look out of place:

  • Range checked for each record” still displays “index map: 0xNNNN”, leaving one to do binary arithmetic with index numbers to figure out which indexes will be checked
  • We’ve got used_key_parts. Is there any need to print key_length anymore?
  • index_merge is still described with notation like key=”intersect(key1,key2)”, and key_length=”5,5″ (where’s the used_key_parts for it?). This notation was the only way to fit the data into tabular format. With JSON, it would be natural to represent individual index scans as separate JSON objects
  • If a subquery is part of WHERE clause, attached_condition will print the whole subquery. That is, subquery’s WHERE condition is printed both in the parent query and in the subquery. This is redundant and annoying.
  • Items with default value are generally not printed. For example, possible_keys=NULL is not. However, using_filesort:false is printed, and so is filtered:100, which means no filtering happened. Why?

Overall, I don’t quite understand what EXPLAIN FORMAT=JSON was targeted at. If EXPLAIN FORMAT=JSON was targeted at humans, I would expect a more user-friendly output. Don’t print redundant information, don’t do redundant quoting, do a better job at indenting the JSON output.
If EXPLAIN FORMAT=JSON was targeted at machines, I would expect it to be fully documented. How can one expect somebody to write programs to process a data format that has no documentation?

MySQL 5.6, InnoDB statistics, and testcases for optimizer

One of the new features in MySQL 5.6 is persistent table statistics. The word “persistent” makes you think that table statistics will be very stable from now on.
This is generally true, but there are some exceptions. There is a particular case where InnoDB statistics may change with no warning, even when there is no user activity on the server. Consider the following pattern (which nearly all bug or feature testcases follow):

CREATE TABLE t1 (...) ENGINE=INNODB;
INSERT INTO t1 VALUES (1),(2),(3),(4),(5);
EXPLAIN SELECT * FROM t1;
EXPLAIN SELECT * FROM t1;

With MySQL 5.6, the following is possible

  • the first EXPLAIN will show that the optimizer expects table t1 to have 6 rows
  • the second EXPLAIN will show that the optimizer expects table t1 to have 5 rows

For me, this was a big surprise. After all, table t1 never had 6 rows. If there was some rounding which rounded up 5 to 6, why did the number go back down to 5 again?

I debugged it, and discovered that

  1. Multi-row INSERT will increment #records-in-table statistics as many times as many rows it has inserted
  2. There is also dict_stats_thread() in storage/innobase/dict/dict0stats_bg.cc. which re-calculates table statistics in the background.

And what happens is:

  1. INSERT inserts a record
  2. dict_stats_thread() comes and re-calculates the statistics (the new record is counted)
  3. INSERT increments the statistics by one (because it has inserted a record on step#1)

This way, right after INSERT has finished, the table statistics will report more records. The number will go back down within a second (because dict_stats_thread() will re-calculate statistics again). This slight and temporary inaccuracy in statistics should not be a problem for production environments. This is a big deal for test cases in mysql-test suite, though. Test cases are special:

  1. Test datasets are often small. Even an off-by-one difference may cause a different query plan to be chosen. For optimizer bugs, this means that the testcase might be no longer hitting the bug it was made for.
  2. Test cases often include EXPLAIN output in .result file. EXPLAIN output has “rows” column, and mysql-test-run framwork will interpret off-by-one difference as a test failure

I’ll repeat myself: the above two properties mean that with MySQL 5.6, it is not possible to write optimizer+InnoDB testcases the way we used to write them before. I’ve observed it in practice – testcases will fail sporadically, depending on how quickly your system is running.

But wait, what about Oracle? How did they manage to develop MySQL 5.6 with this? I was really interested, too, so I dug in the revision history. Oracle developers’ approach was to run ANALYZE TABLE after filling the table. It took them 33 patches to put ANALYZE TABLE everywhere where it was needed (examples: 1, 2, 3, 4, …) but they did it.

The take-away is: if you’re working on a testcase for an optimizer bug in MySQL 5.6 – put an ANALYZE TABLE statement after all the INSERTs but before the SELECTs. This will make your testcase stable.

For the record, we in MariaDB 10.0 did not take the 33 patches with ANALYZE TABLE statements. For now, we have decided to run mysql-test testsuite with --skip-innodb-stats-persistent option. Judging from the EXPLAIN outputs in the tests, that has achieved the same goal (same query plans), and the goal was achieved with just one patch. 🙂

Slides from Percona Live talks: optimizer tutorial and Cassandra Storage Engine

I’ve put online the slides for the two talks that I сo-presented at the Percona Live conference:

The tutorial tries to cover most areas of the optimizer, with focus being on optimization of complex parts of SQL like joins. It also shows how recent additions in MySQL and MariaDB (PERFORMANCE_SCHEMA, SHOW EXPLAIN, EXPLAIN FORMAT=JSON, optimizer_trace) can be useful when diagnosing optimizer problems.

The Cassandra talk starts with an introduction into Cassandra and Cassandra Storage Engine, and then proceeds to discuss the intended use cases for Cassandra Storage Engine. This is an important discussion, because there is a variety of reasons one might want to access Cassandra’s data from SQL, and Cassandra Storage Engine addresses only certain kinds of needs.

No, Sheeri, MySQL 5.6 does not optimize subqueries away

Sheeri wrote a blog post that claims that “IN Subqueries in MySQL 5.6 Are Optimized Away” and uses that as a basis to conclude that subquery optimizations in MySQL 5.6 are superior to MariaDB’s.
The claim is incorrect (and so is the conclusion). As a person who has coded both of the mentioned FirstMatch and semi-join materialization, I think I need to write about this.

Sheeri wrote:

  1. “So MariaDB recognizes the subquery and optimizes it. But it is still optimized as a subquery”
  2. “In MySQL 5.6, the subquery is actually optimized away”

The first statement is somewhat true. The second one is not. I’ll try to explain. The example subquery Sheeri used was:

SELECT title FROM film WHERE film_id IN (SELECT film_id FROM film_actor)

Its meaning is “find films that have actors in table film_actor”. It is not possible to “optimize away” the subquery here. Not more than it’s possible to take the expression “2*3” and optimize away the “*3” part of it. The subquery affects the result of the whole query. You can’t remove it.

What the optimizer (both in MariaDB and in MySQL 5.6) does here is to convert the subquery into a semi-join. Semi-join is basically a “subquery in the WHERE clause” (read the link for more details), and it gives the optimizer more possible choices.
Semi-join can be seen in EXPLAIN EXTENDED. In both MariaDB and MySQL, one can see:

... select `film`.`title` AS `title`
    from `sakila`.`film` semi join (`sakila`.`film_actor`) ...

But what about different query plans? They do not show superiority of one optimizer over the other. As indicated in documentation, MariaDB supports the FirstMatch strategy that was the chosen by MySQL 5.6. Also, MySQL 5.6 supports semi-join materialization strategy that was picked by MariaDB. I suspect, different query plans were chosen because MariaDB and MySQL use different cost formulas. It is not possible to tell whose formulas are better, because both query plans finish in 0.01 seconds, and the tables are very small.

Which means, this example doesn’t allow one to conclude that MySQL’s subquery optimizations are superior to MariaDB (or vice versa) QED.

Addendum. Query plan used by MySQL will read *exactly* the same rows from the tables that MySQL 5.1/5.5 (which have no subquery optimizations) would read. The best you can deduce here is that MySQL 5.6’s subquery optimizations are not making things worse.

New optimization in MariaDB 10.0: EXISTS-to-IN subquery rewrite

MariaDB 10.0 has got another new feature: Sanja Byelkin has pushed EXISTS-to-IN rewrite feature. It is an optimization targeted at EXISTS subqueries. The idea behind it is simple. EXISTS subqueries often have form:

EXISTS (SELECT ...  FROM ... WHERE outer_col=inner_col AND inner_where)

where outer_col=inner_col is the only place where the subquery has references to outside. In this case, the subquery can be converted into an uncorrelated IN:

outer_col IN (SELECT inner_col FROM ... WHERE inner_where)

The conversion opens new opportunities for the optimizer. Correlated EXISTS subquery has only one execution strategy. Uncorrelated IN subquery has two:

  1. re-run the subquery every time the subquery is evaluated (the same as in EXISTS)
  2. Materialize the subquery output into a temporary table. Then, evaluation of subquery predicate will be reduced to making a lookup in that temporary table

The optimizer is able to make a cost-based choice whether to do #1 or #2. But wait, there is more: if the subquery is at top level of the WHERE clause, like

SELECT * FROM person WHERE EXISTS(SELECT... FROM book WHERE book.author=person.name) AND ...

then the subquery is converted into a semi-join, which has even more execution strategies, including those that will use the subquery to “drive” access to tables in the parent select. In the above example, EXISTS-to-IN allows the optimizer to first access the books, and then for each book access its author. When your database has a few books and lots of people, this execution order can give a big speedup.

Now, a little context. EXISTS->IN rewrite is useful, but is not a revolutionary breakthrough. PostgreSQL has a similar optimization. They have introduced it in release 8.4.1, which happened in July, 2009. IIRC MySQL/Sun also had a task for adding EXISTS->IN rewrite, also around 2009. I don’t know what the fate of that task was. It is definitely not in MySQL 5.6, and google doesn’t find the worklog entry, I suppose because it has been made private. MariaDB’s implementation was developed from scratch here at Monty Program Ab. Now, when we have it in MariaDB 10.0, I’m wondering what are the other important subquery cases that we have not yet covered.

Looking at MySQL 5.6's optimizer: EXPLAIN UPDATE

MySQL 5.6 adds support for EXPLAIN UPDATE. This is a useful feature, so we want to have it in MariaDB 10.0, too. Besides that, MariaDB 10.0 has SHOW EXPLAIN feature, and we want it work for UPDATE commands, too.

Now, a bit of code history. Why didn’t MySQL have EXPLAIN UPDATE from the start, like other database systems? To the uninformed, lack of EXPLAIN UPDATE looks like simple lazyness. After all, everyone who has read a database textbook can imagine that the code should have this form:

run_update_query(UpdateQuery q) {
   QueryPlan qp= optimize_query(q);
   run_query_plan(qp);
}

and adding EXPLAIN UPDATE is a matter of adding another function:

run_explain_update(UpdateQuery q) {
   QueryPlan qp= optimize_query(q);
   print_update_query_plan(qp);
}

print_update_query_plan(QueryPlan qp)
{
  // print the plan for UPDATE.
}

Seems like a task for an intern. The problem is that MySQL’s code is not structured this way. There is no point in time where all decisions about
how to run the UPDATE command have been made and stored in a certain data structure, but the query execution didn’t start yet. The code basically is structured like this:

mysql_update()
{
  typename part_of_query_plan;

  ...do something...
  part_of_query_plan=...;
    if (all done)
      return;
  ...do something ...

  if (...)
  {
    typename another_part_of_query_plan;
    ...do something else...
    another_part_of_query_plan= ...
  }

  typename yet_another_part_of_query_pan= ...;
  ...
}

It is not trivial to pull out all of query plan choices out of this. Oracle’s optimizer team had two possible options:

  1. Re-write UPDATE handling code to use the textbook approach.
  2. Keep the current code structure, and inject “if (running_explain) {...}” at many locations.

#1 would be a minor revolution. It would introduce new code that is run for every UPDATE query. New code may have bugs. It may cause query plans to change, and not always for the better.
#2 is conservative. It would keep the old structure in place, and would require less work. The result won’t be very impressive, though – there will be a single piece of code that handles both UPDATE and EXPLAIN UPDATE, with lots of “if (running_explain) {...}” all over it.

I guess, the choice depends on your deadlines, what other changes are there, etc. Oracle’s team choose to do #2. However, when I tried playing with it, I’ve found

  • a query plan that has changed since 5.5 (BUG#67638)
  • a wrong query plan – EXPLAIN doesn’t match the execution (BUG#67637)

I’m not sure if BUG#67638 is a problem. Maybe, it is expected because of the changes in the cost model. However, if the change was expected anyway, why did they choose to use the conservative solution for EXPLAIN UPDATE? And if they did choose a conservative solution for EXPLAIN UPDATE, why do we still get bugs like BUG#67637?

The questions are not just of curiosity. We at MariaDB need to apply the patch, and make it work with SHOW EXPLAIN. Do we wait for Oracle to fix the above bugs, or fix them ourselves? Do we stick to their EXPLAIN UPDATE implementation (and keep applying their fixes), or forget it and roll our own? Decisions, decisions…