Subquery optimizations in MySQL 6.0

I’ve finally found time to put together a couple of wiki pages describing what work we’re doing for subquery optimizations in MySQL 6.0:

  • The Subquery_Works page has an overview of what we’ve done so far, we’re doing now and our future plans.
  • The 6.0 Subquery Optimization Cheatsheet has a short, easy-to-read description of what has been released in MySQL 6.0.3. That should be enough if you just want to get the new version and see if your subquery is/can be faster now or not. If you want to know more details, the nearest occasion is my MySQL University session on February, 28.
  • I’ve done a quick preliminary assessment of the impact of new optimizations. The raw results and first conclusions are here on the 6.0 Subquery Optimization Benchmarks page. We will be doing more analysis and I intend to post more observations based on the already collected data, stay tuned.

That’s everything I have on subqueries so far, but as I’ve already mentioned more is on the way.

Pluggable storage engine interface needs to support table name resolution hooks

I’ve started some attempts at coding ha_trace storage engine I’ve mentioned earlier. One of the first things that became apparent was that I needed a way to put a hook into table name resolution code so I can wrap tables into ha_trace objects.

The need to intercept table name resolution and do something other than looking at the .frm files is not unique to call trace engine:

  • Remote storage engines would benefit also:
    • NDB has a whole chunk of code that ships .frm files from one connected mysqld instance to another. It doesn’t hook into name resolution; it ships table definitions proactively, which could be nuisance if you use a new mysqld node to just connect and run a few queries
    • Federated relies on the user to issue CREATE TABLE statement that matches remote table’s DDL. There are no checks that table defintions match, you can do “interesting” things like omitting columns or having index definitions that don’t match the indexes on the remote table. I think this has little or no value, I’d more appreciate an approach where I provide remote server/database/table name and the server figures out the DDL of the remote table itself.
  • Storage engines that have their own data dictionary could relieve MySQL from the need to use .frm files. I remember a discussion about somebody who had lots (tens of thousands iirc) of tables and was running into issues because of the need to open all those .frm files.

I think the above justifies adding name resolution hooks into storage engine interface. I don’t have enough time/cause to implement it, my intent is to do tracing, not name resolution. If you think you’ll also benefit from the name resolution hook, please post a note here or at, perhaps we could collect enough voices to get it implemented.

Bugfix: DATE vs DATETIME comparisons are now sargable again

We (me coding, Igor and Evgen reviewing) have fixed BUG#32198. According to Support/Bugs team, the bug was causing pain to a number of people, so I thought it deserved a post. The problem was as follows:

  • Before MySQL 5.0.42, comparisons in form
      date_column CMP datetime_const

    were interpreted as comparisons of DATEs (CMP is one of =, <, > , <=, or >=). The time part of datetime_const was ignored.

  • In 5.0.42 we’ve fixed it to perform in a way that is closer to the SQL standard: the comparisons are now resolved by comparing values as DATETIMEs. Our fault was that “date_column CMP datetime_const” stopped being sargable. It turned out there quite a few people who had queries like
      SELECT ... WHERE date_column < NOW() ... 

    and those queries became awfully slow.

  • In 5.0.54, We’ve fixed BUG#32198 and made “date_column CMP datetime_const” sargable again. Apologies to everyone who was affected.

What keeps me concerned is that both old and new interpretation of DATE/DATETIME comparisons are not compatible with PostgreSQL. PostgreSQL does something strange: comparison of DATE column with DATETIME constant is interpreted as DATETIME comparison, while comparison with NOW() seems to be interpreted as comparison of DATEs:

MySQL PostgreSQL
mysql> select now();
| now()               |
| 2007-12-16 18:24:12 |
1 row in set (0.01 sec)
test=# select now();
 2007-12-16 18:23:47.197907+03
(1 row)


mysql> select a from tbl where date_col=now();
Empty set (0.02 sec)
test=# select a from tbl where date_col=now();
(0 rows)


mysql> select a from tbl
    -> where date_col='2007-12-16 18:24:12';
Empty set (0.00 sec)
test=# select a from tbl
test-# where date_col='2007-12-16 18:24:12';
(1 row)

bummer. Why is explicitly specified DATETIME value handled differently from NOW()?

I don’t know what is the logic behind this (or is this a bug in PostgreSQL?) If you do, please let me know.

In my SQL code, I think I’m going to play it safe and always explicitly cast to either DATE or DATETIME.

Slides and audio: How MySQL Handles ORDER BY, GROUP BY, and DISTINCT

I haven’t posted anything for a while, but that doesn’t mean that nothing was happening around MySQL Optimizer. In fact, exactly the reverse was true – so many thing were happening that I didn’t have time to blog.

I’ll try to make up for the lost coverage. Let’s start from smaller things. On Nov, 1 I’ve had a MySQL University session titled How MySQL Handles ORDER BY, GROUP BY, and DISTINCT.

The session page has slides and audio. If you’re ready to spend 1 hour on MySQL’s ways of doing ordering, you can listen to the audio. Otherwise, you can just scroll through the slides. They are reasonably self-contained and explain MySQL’s ORDER BY and GROUP BY handling in easy-to-digest colorful pictures.

How MySQL executes ORDER BY

In last couple of weeks there has been a tide of ORDER/GROUP BY-related optimization bugs, where I was the fixer or the reviewer. This wasn’t an easy job because there is no sane description of how GROUP BY/ORDER BY handling is supposed to work.

To figure it out, I had to write an explanation of how it works. The first part is about ORDER BY. Hopefully there will be subsequent parts that will show how GROUP BY is related to ORDER BY and how it works.

Available means to produce ordered sequences

MySQL has two methods to produce ordered streams.

The first is to use a “range“, “ref” or “index” access method over an ordered index. For versions up to 5.1, those access methods naturally return records in the index order, so we get ordering for free (the exception is NDB engine which needs to do merge-sort when it gets data from several storage nodes). In MySQL 5.2, MyISAM and InnoDB have MultiRangeRead optimization which breaks the ordering. We have a draft of how to make it preserve the ordering, but at the moment MRR is simply disabled whenever ordering is required.

The second, catch-all method, is to use the filesort algorithm. In a nutshell, filesort() does quicksort on chunks of data that fit into its memory and then uses mergesort approach to merge the chunks. The amount of memory available to filesort() is controlled by @@sort_buffer_size variable. if the sorted data doesn’t fit into memory (i.e. there is more than one chunk), filesort uses a temporary file to store the chunks.

Source data for filesort() always comes from one table. If there is a need to sort data from several tables, MySQL will first collect the data into a temporary table and then invoke filesort() for that temporary table. I don’t know the true reason for this. Codewise, filesort() wants to pull its source data using something like source.get_next_record() function, while join and union runtime produce their using result.put_next_record()-type calls, so maybe the temporary table is there only to resolve this push/pull mismatch and will go away once we get decent cursors.

filesort() has two modes of operation:

  1. Mode 1: the sorted elements contain all required columns of the source table. The result of the sorting is a linear sequence of output tuples, there is no need to access the source table after the sort is done.
  2. Mode 2: sort <sort_key, rowid> pairs and produce a sequence of rowids which one can use to get source table’s rows in the required order (but this will be essentially hit the table in random order and is not very fast)

Mode 1 is used whenever possible. Mode is used when mode1 is not applicable. This is the case when the sorted tuples have blobs or variable-length columns (TODO: check w/ evgen). Unfortunately, the EXPLAIN output provides no clue about which mode is used, so you’ll have to manually check for blobs in the output columns list.

Executing join to produce ordered stream

At the moment MySQL has three ways to run a join and produce ordered output:

Method EXPLAIN output
Use index-based access method that produces ordered output no mention of filesort
Use filesort() on 1st non-constant table “Using filesort” in the first row
Put join result into a temporary table and use filesort() on it “Using temporary; Using filesort” in the first row

Now I’ll cover those three methods in more detail. The first method can be used when the first non-const table in the join order has an index that matches the ORDER BY list. In this case we can use the index, and the ordering will “spread over” other tables onto the output. Here is a swimlane diagram of this process, where different columns represent different values of the ORDER BY expression:
This method is preferred over the other two as it requires no additional sorting steps.

The second method can be used when all ORDER BY elements refer to the first table in the join order. In this case, we can filesort() the first table and then proceed to execute the join:
Here filesort() may operate either in Mode 1 or in Mode 2. One may wonder why this is limited to doing filesort() after the first table. After all, we could do it after the second table as well – produce (tbl1, tbl2) record combinations, put them into temporary table, sort, etc. The expectation is perfectly reasonable but alas, MySQL will not even consider such query plans.

The last, the catch-all method is to write the entire join output into the temporary table and then invoke filesort:

I have an easier time recalling those three strategies when there are pictures, hopefully they’ll help you too. That’s all for today, in next posts I’ll cover the topics of how ordering affects join optimization and interoperates with GROUP BY.

An idea: create ha_trace tracing storage engine

Our exprience in solving query optimizer problems shows that a good chunk of optimization problems are incorrect choice of join order. The most frequent causes of the problems are

  1. Table engine returns index statistics or records-in-range numbers that are very far from reality;
  2. The WHERE clause has high-selectivity conditions for one table and low-selectivity conditions for the other. The optimizer is not aware of this and chooses a poor join order;
  3. Bugs or shortcomings in the optimizer or the table engine;

At the moment investigation of those kinds of problems is hard and time-consuming:

  • There is no easy way to find out what the storage engine has returned to the optimizer from records-in-range calls. One has to manually repeat the steps taken by equality propagation, constant table detection and other query rewrites, construct the table’s condition and run the corresponding single-table EXPLAIN;
  • You have to construct and run separate queries to check if the table/index statistics matches the actual distribution of the data examined by the query (far-off statistics values for skewed data distribution is a frequent problem);
  • There is no easy way to obtain information about the selectivity of the table’s condition. You either need a debugger or, in some cases, you can manage with doing non-trivial calculations over the values of Handler_read_XXX counters.

A completely different problem is that now some storage engines are provided by other parties than MySQL, and we’ll need to be able to find out who is at fault when something doesn’t work with a 3rd-party engine.

Having spent some time thinking about both of the above, I’ve got an idea that both can be resolved by a single solution – write a tracing storage engine.

The ha_trace engine

The engine will be used this way:

  1. You take the tables that are involved in the problematic query and “wrap” them in ha_trace tables:
    CREATE TABLE trace_tbl1 (... copy the DDL ...) ENGINE=TRACE UNDERLYING_TBL='tbl1' 

    The trace tables look and can be queried exactly like their underlying tables

  2. You run the problematic query againist the wrapper tables. The trace engine intercepts all storage engine calls and either just records them or accumulates some summary information.
  3. You obtain the trace results by querying some special $trace$results table.

The trace engine can do many practically useful things:

Statistics checker mode
This mode will check if the engine statistics numbers are any good. ha_trace will intercept statistics and record retrieval API calls and compile a report on how much were the differences between statistics and the observed table data. It could also build histogram of the observed data, which would give a clue about whether running ANALYZE TABLE would help.

Call trace mode
This is similar to what ltrace(1) does. The advantage over ltrace is that ha_trace knows the meaning of the calls and can produce more meaningful output. In particular, it can provide smarter call filtering and compression of repetitive call patterns.

Call trace logs can be used to check if the performed calls are compliant with the storage engine API and find out who is the violator, SQL Layer or the engine.

Query trace mode
This is similar to call trace but more high level. It will provide a view of how query was executed, giving information about how many rows were scanned in each table and so forth. We’ll need to use some kind of compression to reduce the size of the collected data. I’m thinking of building a regular expression/automaton with counters in the edges, so the collected data looks like this:

  t1.rnd_init(TRUE /*do sequential scan*/)= 0
  1000 * {
      t1.rnd_next()= 0
      t2.index_read(...)= {0 - 700 times, EOF=300 times}
      20 (min=1,max=200) * { t2.index_next_same() }

If you haven’t already figured out how the EXPLAIN of this query looks, here it is:

explain select * from t1, t2 where t2.key=t1.col;
| id | select_type | table | type | possible_keys | key  | key_len | ref     | rows | Extra |
|  1 | SIMPLE      | t1    | ALL  | NULL          | NULL | NULL    | NULL    | 1000 |       |
|  1 | SIMPLE      | t2    | ref  | a             | a    | 5       | t1.key1 |   22 |       |

The trace shows a lot of useful information:
* In 300 of 1000 cases, table t2 had no matching records
* When there were matches, there were on average 20 matches
* There was an off-the-scale case when there were 200 matching rows
And you can just see all this, without the doing arithmetic exercises over ‘Handler_read_%’ status variables.

Whether this is the right solution

Wrapper engine is the natural solution for troubleshooting problems between storage engines and the SQL layer (think of ltrace, strace, ODBC trace, etc etc). It is a reasonably good query trace solution too:

  • Per-table and per-query tracing is a fine level of granularity that is big step ahead of counters
  • ha_trace adds exactly zero overhead when it is not enabled
  • Works for all current and future storage engines
  • ha_trace is probably the only tracing solution that doesn’t require making changes all over the code.

Having written all this in one sitting (well, also some subway standing), I’ll now try to spend the remainder of this weekend in a less geeky way.

EXPLAIN: "Using join cache" renamed to "Using join buffer"

As soon as I’ve finished writing this post about “Using join cache”, it was apparent that “Using join cache” is poor wording. First, the corresponding server variable is called @@join_buffer_size, not join cache size, and second, there is really no cache involved.

We’ve had a discussion about how this should be called. Some sources use term Block nested-loops join but we’ve settled on “Using join buffer“. Another change is that we’ve decided to move the note one line down to the table that “does the buffering”. As a result, what was this

mysql> explain select * from t1, t2 where t1.col < 10 and t2.col < 'bar';
| id | select_type | table | type  |     | Extra                         |
|  1 | SIMPLE      | t1    | range |     | Using where; Using join cache |
|  1 | SIMPLE      | t2    | range |     | Using where                   |

became this:

mysql> explain select * from t1, t2 where t1.col < 10 and t2.col < 'bar';
| id | select_type | table | type  |     | Extra                          |
|  1 | SIMPLE      | t1    | range |     | Using where                    |
|  1 | SIMPLE      | t2    | range |     | Using where; Using join buffer |

The change was pushed into 5.1.19 tree. MySQL 5.1.18 is the only version that shows “Using join cache”. I’ve also updated the original explanation post to display the final variants of EXPLAIN output.

Partition pruning tip: Use comparisons of column, not of partitioning function value

A customer issue has drawn my attention to this this pecularity: if partitioning is done by the value of some function, then partition pruning module will make use of comparisons of the partitioning column(s), but not of comparisons of the value of the partitioning function. Here is an example:

  PARTITION p0 VALUES LESS THAN ( TO_DAYS('2007-01-01') ),
  PARTITION p1 VALUES LESS THAN ( TO_DAYS('2007-02-01') ),
| id | select_type | table | partitions | type |
|  1 | SIMPLE      | t1    | p1         | ALL  |

| id | select_type | table | partitions     | type |
|  1 | simple      | t1    | p0,p1,p2,p3,p4 | all  |

This also applies to cases where partitioning function is a function of multiple columns. This gives us this tip:

For partition pruning to work, the WHERE clause must use comparisons of columns used by partitioning function, not of the result value of the partitioning function.

This doesn’t look like a totally natural limitation, but that is what is in effect for the time being.

The impact of this limitation and whether it can be lifted

The limitation can be easily worked around in cases where partitioning function is unary and monotonic, i.e where one can easily convert back and forth between comparison of partitioning function:

  monotonic_part_func(const1) < monotonic_part_func(part_col) < monotonic_part_func(const2)

and comparison of partitioning column:

const1 < part_col < const2

Partition pruning module can do such conversions, too, so it is not difficult to make MySQL handle such cases automatically. The case with non-monotonic or n-ary partitioning function is more difficult. Partition pruning uses range analysis over column ordering, that is, it analyzes the WHERE clause, converts it to a disjoint list of intervals like:

(a < part_col < b) OR (part_col = c) OR ...

and then finds which partitions those intervals fall into. We could switch to part_func(part_col) ordering, i.e. collect intervals like

(a < part_func(part_col) < b) OR (part_func(part_col) = c) OR ...

but this will make it hard to handle predicates like “a < part_col < b” (remember I’m talking about non-monotonic case, where X < Y does not mean that part_func(X) < part_func(Y)).

We could do two partition pruning passes, one with part_col ordering and the other with part_func(part_col) ordering, but that is slow, ugly, and will not handle cases like

part_col=const1 OR part_func(part_col)=const2

We could stop doing the range analysis altogether and operate on sets of used partitions, that is, recursively process the WHERE clause using a set of rules like this:

# AND/OR recursion
partitions(cond1 AND cond2) := intersect(partitions(cond1), partitions(cond2))
partitions(cond1 OR cond2) := union(partitions(cond1), partitions(cond2))

# Various atomic predicates
partitions(part_col = C) := { partition_no(part_func(C)) }
partitions(part_col < C) := ...

but that will be slow, especially when there are many partitions. Also this technique will work poorly (or get very complicated) in cases where partitioning function uses several columns, and comparisons of those columns are on different AND/OR branches in the WHERE clause.

Neither solution is prefect, and each of them adds some overhead. It seems we’ll need to collect some user input before we decide what (if anything) we should do.

Use of join buffer is now visible in EXPLAIN

* s/Using join cache/Using join buffer/, changed to show the final variants of EXPLAIN output as described here
* s/join_buff_size/join_buffer_size/

Starting from 5.1.18, EXPLAIN output may show “Using join cache“, like in this example:

mysql> explain select * from t1, t2 where t1.col < 10 and t2.col < 'bar';
| id | select_type | table | type  |     | Extra                          |
|  1 | SIMPLE      | t1    | range |     | Using where                    |
|  1 | SIMPLE      | t2    | range |     | Using where; Using join buffer |

The join cache is actually not a new feature. It has been available in MySQL at least since version 4.0, and for all this time it has remained invisible and undocumented. The only thing that indicated its presense was the @@join_buffer_size server variable.

We’re trying to gradually make EXPLAIN show more information. Georgi Kodinov was fixing BUG#27531 and has used that occasion to make join buffering show up in EXPLAIN output.

If you already know how MySQL’s join buffering works, that’s all the news. If not, the remainder of this post has a hopefully readable explanation of how join buffering works and when it is used.

How join buffering works

Let’s start with regular Nested Loops Join. Suppose we have a join query

select * from t1, t2, t3 where t2.key1=t1.col1 and t3.key1<40;

and the query plain is like shown in this EXPLAIN output:

    | table | type  | possible_keys | key  | key_len | ref          | rows | Extra       |
    | tbl1  | ALL   | NULL          | NULL | NULL    | NULL         |   10 |             |
    | tbl2  | ref   | key1          | key1 | 5       | db.tbl1.col1 |    2 | Using where |
    | tbl3  | range | key1          | key1 | 5       | NULL         |   40 | Using where |

When no join buffering is used, the query will be executed as follows:

  for each record t1rec in table tbl1
    for each record t2rec in tbl2 such that t2rec.key1=t1rec.col
      for each record t3rec in tbl3 such that t3rec.key1<40
        pass the (t1rec, t2rec, t3rec) row combination to output;

Graphically the execution flow can be depicted as follows (yellow are the table scans, blue are the table rows):


From the code and picture we see that:

  • Table tbl2 is scanned several times, but each scan accesses a different part of the table
  • Table tbl3 is scanned many times, and all performed scans are identical

It is apparent that the second and the third scans of table tbl3 bring no new information and can be removed. We do not have to re-scan tbl3 for any row combination from tables tbl1, tbl2. Instead, we could accumulate a back of such row combination, and then do one tbl3 scan for all of them. And this is what join buffering is.

In pseudo-code, the execution will look as follows:

  for each record t1rec in table tbl1
    for each record t2rec in tbl2 such that t2rec.key1=t1rec.col
      put (t1rec, t2rec) into the buffer
      if (buffer is full)

  flush_buffer() {
    for each record t3rec in tbl3 such that t3rec.key1<40
      for each record in the buffer
        pass (t1rec, t2rec, t3rec) row combination to output;
    empty the buffer;

And graphically it will look as follows:


The EXPLAIN output will be as follows:

explain select * from t1,t2,t3 where t2.key1 = t1.col1 and t3.key1<40;
    | table | type  | possible_keys | key  | key_len | ref          | rows | Extra                          |
    | t1    | ALL   | NULL          | NULL | NULL    | NULL         |   10 |                                |
    | t2    | ref   | key1          | key1 | 5       | test.t1.col1 |    2 | Using where                    |
    | t3    | range | key1          | key1 | 5       | NULL         |   40 | Using where; Using join buffer |

In this example join buffering is used for one table, but it can be used for several tables as well. MySQL uses join buffering whenever it can, access to some table tbl_x will be bufferred if

  • The SELECT does not have an ORDER BY clause
  • We’re not at top level “select” of a multi-table UPDATE
  • tbl_x is accessed using an “independent” access method: ALL, index, range, or index_merge.
  • tbl_x is not inner w.r.t. some outer join

The server variable @@join_buffer_size specifies how much MySQL should allocate for each buffer. That is, if two tables use buffering, MySQL will allocate two buffers of @@join_buffer_size bytes each.

Thoughts on partitioning optimizations

I’m supposed to be working on subquery optimization but I can’t get BUG#26630 out of my head.

The bug test case is small and easy to understand: looking at the EXPLAIN:

create table tbl_test( ... ) partition by range(num) (...);

explain partitions
select * from tbl_co c straight_join tbl_test t where t.num=c.num and reg=8;
  | table | partitions  | type | possible_keys | key  | key_len | ref  | rows | Extra       |
  | c     | NULL        | ALL  | NULL          | NULL | NULL    | NULL |   17 | Using where |
  | t     | p0,p1,p2,p3 | ALL  | NULL          | NULL | NULL    | NULL |   17 | Using where |

we see that for table t no partition pruning is performed. While it is apparent that MySQL could do this (pieces in red font indicate what’s missing):

  for each row in table c
     $p= <find partition that has rows such that t.num = c.num>;
     for each row in table t, within partition $p
       if (where clause matches)
         pass row combination to output;

Partition Selection does something similar, but it will work only if there is an INDEX(t.num) and the optimizer choses to access table t via ref(t.num=c.num). This doesn’t hold in our case, so all partitions will be scanned every time.

Ok, now on to the general thoughts. MySQL has two kinds of partitioning optimizations:

  1. Partition pruning is performed statically, i.e. we look at the query and infer a set of partitions that do not need to be accessed, no matter which query execution plan will be used.
  2. Partition selection is the opposite: when we’re executing the query, and do an index lookup on
      tbl.keypart1=const1 AND tbl.keypart2=const2 AND ... AND tbl.keypartN=constN

    we check if those equalities allow us to determine one single partition that needs to be accessed. If yes, we access only that partition.

That is, we have

Partition pruning
  • static (can use predicates that depend only on this table)
  • thorough predicate analysis (can use >, <, BETWEEN, etc)
Partition selection
  • dynamic (can use predicates like tbl.x=tbl2.y)
  • can use equalities only

This dualism has its direct counterpart in MySQL table access methods. If we ignore “special” methods like fulltext and loose index scan, we have

  • “static” access methods (can use predicates that depend only on this table)
  • thorough predicate analysis
ref-family methods
  • dynamic (can use predicates like tbl.x=tbl2.y)
  • can use equalities only

In fact, Partition Pruning is performed by creating a fake “index description”, running range/index_merge analyzer and then analyzing the obtained ranges.

Partition Selection could have been implemented in a similar way by re-using the ref-family methods analyzer: we could create a fake index description, run the analyzer, and then check if there are any potential ref accesses that use that index. If we have a ref-access candidate on

  tbl.partition_col = some_expresion(some_tables)

then we will know that we only need to access one partition, and we’ll know how to find out which. This solution is better than Partition Selection because

  • The requirement that we use an index that covers all partitioned columns will be lifted
  • The optimizer will know that only one partition will be accessed (currently this is discovered only at runtime) and will be able to take this into account
    • and it will be easy to show it in EXPLAIN, too
  • Architecturally, everything will look very nice:
    Engine User
    range/index_merge analyzer partition pruning
    ref-family analyzer partition selection

Having written all this I now realize that the example of BUG#26630 will probably not be resolved – it will still have to use all partitions because of the automatic unconditional use of “join buffering”. Well, hopefully the reporter does not really intend to run cross joins and has some will be satisfied with ability to use indexes that do not cover all partitioned fields.


Now a small rant. It seems all this is an example of Conway’s Law:

  1. Partition pruning was designed/implemented by people who were familiar with range/index_merge analyzer. Hence the reuse.
  2. Partition selection was designed/implemented by people who I beleive were not familiar with ref-family analyzer. Hence, no reuse. They were familiar with table handler interface and so partition selection went into the handler interface.