Slides: New subquery optimizations in MySQL 6.0

A bunch of (hopefully) self-explanatory slides about new subquery optimizations in MySQL 6.0 is available here (UPDATE: here’s a working link). The slides are from this MySQL University session, so there was an audio stream but there were some difficulties with it and it is not available now.

If you miss the audio, the next opportunity to hear about this topic is the New Subquery Optimizations in MySQL 6.0 session I’m giving at the MySQL User Conference on April, 15th.

If you don’t miss the audio, please check out the mentioned session anyway – we already have more content than there are in the slides, and I expect to get even more till the MySQL UC.


Observations about subquery use cases

As I wrote earlier, we (me and Ranger) have done an assessment of the impact of new 6.0 subquery optimizations. First, we’ve done a search for performance issues in MySQL’s bug database. The raw list is here, and the here is a summary chart:

Search for customer issues in MySQL’s issue tracker system has produced a similar picture (raw list):

Search in DBT-{1,2,3,4} open source benchmark suites produced a radically different result though (raw data is here):

The samples are small but I think one can already conclude that the benchmarks and user cases have different kinds of subqueries. The cause of the discrepancy is not clear. Are people’s applications different from what the benchmarks simulate? If yes, what is the benchmark that simulates the apps our users are running?

A dumb search for subqueries in random open source applications using grep didn’t produce much. Again, no idea why, either I was looking at the wrong applications, or it could be that applications that use subqueries do not have SQL statements embedded in their code so one can’t find them with grep, or something else.

So far the outcome is that it’s nice that the new optimizations [will] capture a good fraction of real world cases, but there is no benchmark we could run to check or demonstrate this. Well, I’d still prefer this situation over the the opposite.

Evgen's CONNECT BY patch is in… PostgreSQL (oops, actually not. more details inside)

Evgen Potemkin is a developer in query optimizer team at MySQL, we’ve been working together for about 3 years now. So, it was a bit unexpected to read in the PostgreSQL news that (originally) his CONNECT BY patch has made it into PostgreSQL. No, we don’t have people coding for two DBMSes at the same time (yet?), the patch has been developed years before Evgen has joined MySQL.

One of the questions raised by the prospect of becoming a part of Sun was whether/how we will handle being a part of company that develops other DBMS as well. I suppose this patch news sets a good start.

It turns out the patch is not in the PostgreSQL tree after all. It is just people at (which is NOT just a different name for, and that was my error) merged the patch into a 8.3 version of PostgreSQL and made the result available to download).

Sorry for creating confusion, and thanks to those who pointed this out in the comments.

Subquery optimizations in MySQL 6.0

I’ve finally found time to put together a couple of wiki pages describing what work we’re doing for subquery optimizations in MySQL 6.0:

  • The Subquery_Works page has an overview of what we’ve done so far, we’re doing now and our future plans.
  • The 6.0 Subquery Optimization Cheatsheet has a short, easy-to-read description of what has been released in MySQL 6.0.3. That should be enough if you just want to get the new version and see if your subquery is/can be faster now or not. If you want to know more details, the nearest occasion is my MySQL University session on February, 28.
  • I’ve done a quick preliminary assessment of the impact of new optimizations. The raw results and first conclusions are here on the 6.0 Subquery Optimization Benchmarks page. We will be doing more analysis and I intend to post more observations based on the already collected data, stay tuned.

That’s everything I have on subqueries so far, but as I’ve already mentioned more is on the way.

Pluggable storage engine interface needs to support table name resolution hooks

I’ve started some attempts at coding ha_trace storage engine I’ve mentioned earlier. One of the first things that became apparent was that I needed a way to put a hook into table name resolution code so I can wrap tables into ha_trace objects.

The need to intercept table name resolution and do something other than looking at the .frm files is not unique to call trace engine:

  • Remote storage engines would benefit also:
    • NDB has a whole chunk of code that ships .frm files from one connected mysqld instance to another. It doesn’t hook into name resolution; it ships table definitions proactively, which could be nuisance if you use a new mysqld node to just connect and run a few queries
    • Federated relies on the user to issue CREATE TABLE statement that matches remote table’s DDL. There are no checks that table defintions match, you can do “interesting” things like omitting columns or having index definitions that don’t match the indexes on the remote table. I think this has little or no value, I’d more appreciate an approach where I provide remote server/database/table name and the server figures out the DDL of the remote table itself.
  • Storage engines that have their own data dictionary could relieve MySQL from the need to use .frm files. I remember a discussion about somebody who had lots (tens of thousands iirc) of tables and was running into issues because of the need to open all those .frm files.

I think the above justifies adding name resolution hooks into storage engine interface. I don’t have enough time/cause to implement it, my intent is to do tracing, not name resolution. If you think you’ll also benefit from the name resolution hook, please post a note here or at, perhaps we could collect enough voices to get it implemented.

Bugfix: DATE vs DATETIME comparisons are now sargable again

We (me coding, Igor and Evgen reviewing) have fixed BUG#32198. According to Support/Bugs team, the bug was causing pain to a number of people, so I thought it deserved a post. The problem was as follows:

  • Before MySQL 5.0.42, comparisons in form
      date_column CMP datetime_const

    were interpreted as comparisons of DATEs (CMP is one of =, <, > , <=, or >=). The time part of datetime_const was ignored.

  • In 5.0.42 we’ve fixed it to perform in a way that is closer to the SQL standard: the comparisons are now resolved by comparing values as DATETIMEs. Our fault was that “date_column CMP datetime_const” stopped being sargable. It turned out there quite a few people who had queries like
      SELECT ... WHERE date_column < NOW() ... 

    and those queries became awfully slow.

  • In 5.0.54, We’ve fixed BUG#32198 and made “date_column CMP datetime_const” sargable again. Apologies to everyone who was affected.

What keeps me concerned is that both old and new interpretation of DATE/DATETIME comparisons are not compatible with PostgreSQL. PostgreSQL does something strange: comparison of DATE column with DATETIME constant is interpreted as DATETIME comparison, while comparison with NOW() seems to be interpreted as comparison of DATEs:

MySQL PostgreSQL
mysql> select now();
| now()               |
| 2007-12-16 18:24:12 |
1 row in set (0.01 sec)
test=# select now();
 2007-12-16 18:23:47.197907+03
(1 row)


mysql> select a from tbl where date_col=now();
Empty set (0.02 sec)
test=# select a from tbl where date_col=now();
(0 rows)


mysql> select a from tbl
    -> where date_col='2007-12-16 18:24:12';
Empty set (0.00 sec)
test=# select a from tbl
test-# where date_col='2007-12-16 18:24:12';
(1 row)

bummer. Why is explicitly specified DATETIME value handled differently from NOW()?

I don’t know what is the logic behind this (or is this a bug in PostgreSQL?) If you do, please let me know.

In my SQL code, I think I’m going to play it safe and always explicitly cast to either DATE or DATETIME.

Slides and audio: How MySQL Handles ORDER BY, GROUP BY, and DISTINCT

I haven’t posted anything for a while, but that doesn’t mean that nothing was happening around MySQL Optimizer. In fact, exactly the reverse was true – so many thing were happening that I didn’t have time to blog.

I’ll try to make up for the lost coverage. Let’s start from smaller things. On Nov, 1 I’ve had a MySQL University session titled How MySQL Handles ORDER BY, GROUP BY, and DISTINCT.

The session page has slides and audio. If you’re ready to spend 1 hour on MySQL’s ways of doing ordering, you can listen to the audio. Otherwise, you can just scroll through the slides. They are reasonably self-contained and explain MySQL’s ORDER BY and GROUP BY handling in easy-to-digest colorful pictures.

How MySQL executes ORDER BY

In last couple of weeks there has been a tide of ORDER/GROUP BY-related optimization bugs, where I was the fixer or the reviewer. This wasn’t an easy job because there is no sane description of how GROUP BY/ORDER BY handling is supposed to work.

To figure it out, I had to write an explanation of how it works. The first part is about ORDER BY. Hopefully there will be subsequent parts that will show how GROUP BY is related to ORDER BY and how it works.

Available means to produce ordered sequences

MySQL has two methods to produce ordered streams.

The first is to use a “range“, “ref” or “index” access method over an ordered index. For versions up to 5.1, those access methods naturally return records in the index order, so we get ordering for free (the exception is NDB engine which needs to do merge-sort when it gets data from several storage nodes). In MySQL 5.2, MyISAM and InnoDB have MultiRangeRead optimization which breaks the ordering. We have a draft of how to make it preserve the ordering, but at the moment MRR is simply disabled whenever ordering is required.

The second, catch-all method, is to use the filesort algorithm. In a nutshell, filesort() does quicksort on chunks of data that fit into its memory and then uses mergesort approach to merge the chunks. The amount of memory available to filesort() is controlled by @@sort_buffer_size variable. if the sorted data doesn’t fit into memory (i.e. there is more than one chunk), filesort uses a temporary file to store the chunks.

Source data for filesort() always comes from one table. If there is a need to sort data from several tables, MySQL will first collect the data into a temporary table and then invoke filesort() for that temporary table. I don’t know the true reason for this. Codewise, filesort() wants to pull its source data using something like source.get_next_record() function, while join and union runtime produce their using result.put_next_record()-type calls, so maybe the temporary table is there only to resolve this push/pull mismatch and will go away once we get decent cursors.

filesort() has two modes of operation:

  1. Mode 1: the sorted elements contain all required columns of the source table. The result of the sorting is a linear sequence of output tuples, there is no need to access the source table after the sort is done.
  2. Mode 2: sort <sort_key, rowid> pairs and produce a sequence of rowids which one can use to get source table’s rows in the required order (but this will be essentially hit the table in random order and is not very fast)

Mode 1 is used whenever possible. Mode is used when mode1 is not applicable. This is the case when the sorted tuples have blobs or variable-length columns (TODO: check w/ evgen). Unfortunately, the EXPLAIN output provides no clue about which mode is used, so you’ll have to manually check for blobs in the output columns list.

Executing join to produce ordered stream

At the moment MySQL has three ways to run a join and produce ordered output:

Method EXPLAIN output
Use index-based access method that produces ordered output no mention of filesort
Use filesort() on 1st non-constant table “Using filesort” in the first row
Put join result into a temporary table and use filesort() on it “Using temporary; Using filesort” in the first row

Now I’ll cover those three methods in more detail. The first method can be used when the first non-const table in the join order has an index that matches the ORDER BY list. In this case we can use the index, and the ordering will “spread over” other tables onto the output. Here is a swimlane diagram of this process, where different columns represent different values of the ORDER BY expression:
This method is preferred over the other two as it requires no additional sorting steps.

The second method can be used when all ORDER BY elements refer to the first table in the join order. In this case, we can filesort() the first table and then proceed to execute the join:
Here filesort() may operate either in Mode 1 or in Mode 2. One may wonder why this is limited to doing filesort() after the first table. After all, we could do it after the second table as well – produce (tbl1, tbl2) record combinations, put them into temporary table, sort, etc. The expectation is perfectly reasonable but alas, MySQL will not even consider such query plans.

The last, the catch-all method is to write the entire join output into the temporary table and then invoke filesort:

I have an easier time recalling those three strategies when there are pictures, hopefully they’ll help you too. That’s all for today, in next posts I’ll cover the topics of how ordering affects join optimization and interoperates with GROUP BY.

An idea: create ha_trace tracing storage engine

Our exprience in solving query optimizer problems shows that a good chunk of optimization problems are incorrect choice of join order. The most frequent causes of the problems are

  1. Table engine returns index statistics or records-in-range numbers that are very far from reality;
  2. The WHERE clause has high-selectivity conditions for one table and low-selectivity conditions for the other. The optimizer is not aware of this and chooses a poor join order;
  3. Bugs or shortcomings in the optimizer or the table engine;

At the moment investigation of those kinds of problems is hard and time-consuming:

  • There is no easy way to find out what the storage engine has returned to the optimizer from records-in-range calls. One has to manually repeat the steps taken by equality propagation, constant table detection and other query rewrites, construct the table’s condition and run the corresponding single-table EXPLAIN;
  • You have to construct and run separate queries to check if the table/index statistics matches the actual distribution of the data examined by the query (far-off statistics values for skewed data distribution is a frequent problem);
  • There is no easy way to obtain information about the selectivity of the table’s condition. You either need a debugger or, in some cases, you can manage with doing non-trivial calculations over the values of Handler_read_XXX counters.

A completely different problem is that now some storage engines are provided by other parties than MySQL, and we’ll need to be able to find out who is at fault when something doesn’t work with a 3rd-party engine.

Having spent some time thinking about both of the above, I’ve got an idea that both can be resolved by a single solution – write a tracing storage engine.

The ha_trace engine

The engine will be used this way:

  1. You take the tables that are involved in the problematic query and “wrap” them in ha_trace tables:
    CREATE TABLE trace_tbl1 (... copy the DDL ...) ENGINE=TRACE UNDERLYING_TBL='tbl1' 

    The trace tables look and can be queried exactly like their underlying tables

  2. You run the problematic query againist the wrapper tables. The trace engine intercepts all storage engine calls and either just records them or accumulates some summary information.
  3. You obtain the trace results by querying some special $trace$results table.

The trace engine can do many practically useful things:

Statistics checker mode
This mode will check if the engine statistics numbers are any good. ha_trace will intercept statistics and record retrieval API calls and compile a report on how much were the differences between statistics and the observed table data. It could also build histogram of the observed data, which would give a clue about whether running ANALYZE TABLE would help.

Call trace mode
This is similar to what ltrace(1) does. The advantage over ltrace is that ha_trace knows the meaning of the calls and can produce more meaningful output. In particular, it can provide smarter call filtering and compression of repetitive call patterns.

Call trace logs can be used to check if the performed calls are compliant with the storage engine API and find out who is the violator, SQL Layer or the engine.

Query trace mode
This is similar to call trace but more high level. It will provide a view of how query was executed, giving information about how many rows were scanned in each table and so forth. We’ll need to use some kind of compression to reduce the size of the collected data. I’m thinking of building a regular expression/automaton with counters in the edges, so the collected data looks like this:

  t1.rnd_init(TRUE /*do sequential scan*/)= 0
  1000 * {
      t1.rnd_next()= 0
      t2.index_read(...)= {0 - 700 times, EOF=300 times}
      20 (min=1,max=200) * { t2.index_next_same() }

If you haven’t already figured out how the EXPLAIN of this query looks, here it is:

explain select * from t1, t2 where t2.key=t1.col;
| id | select_type | table | type | possible_keys | key  | key_len | ref     | rows | Extra |
|  1 | SIMPLE      | t1    | ALL  | NULL          | NULL | NULL    | NULL    | 1000 |       |
|  1 | SIMPLE      | t2    | ref  | a             | a    | 5       | t1.key1 |   22 |       |

The trace shows a lot of useful information:
* In 300 of 1000 cases, table t2 had no matching records
* When there were matches, there were on average 20 matches
* There was an off-the-scale case when there were 200 matching rows
And you can just see all this, without the doing arithmetic exercises over ‘Handler_read_%’ status variables.

Whether this is the right solution

Wrapper engine is the natural solution for troubleshooting problems between storage engines and the SQL layer (think of ltrace, strace, ODBC trace, etc etc). It is a reasonably good query trace solution too:

  • Per-table and per-query tracing is a fine level of granularity that is big step ahead of counters
  • ha_trace adds exactly zero overhead when it is not enabled
  • Works for all current and future storage engines
  • ha_trace is probably the only tracing solution that doesn’t require making changes all over the code.

Having written all this in one sitting (well, also some subway standing), I’ll now try to spend the remainder of this weekend in a less geeky way.

EXPLAIN: "Using join cache" renamed to "Using join buffer"

As soon as I’ve finished writing this post about “Using join cache”, it was apparent that “Using join cache” is poor wording. First, the corresponding server variable is called @@join_buffer_size, not join cache size, and second, there is really no cache involved.

We’ve had a discussion about how this should be called. Some sources use term Block nested-loops join but we’ve settled on “Using join buffer“. Another change is that we’ve decided to move the note one line down to the table that “does the buffering”. As a result, what was this

mysql> explain select * from t1, t2 where t1.col < 10 and t2.col < 'bar';
| id | select_type | table | type  |     | Extra                         |
|  1 | SIMPLE      | t1    | range |     | Using where; Using join cache |
|  1 | SIMPLE      | t2    | range |     | Using where                   |

became this:

mysql> explain select * from t1, t2 where t1.col < 10 and t2.col < 'bar';
| id | select_type | table | type  |     | Extra                          |
|  1 | SIMPLE      | t1    | range |     | Using where                    |
|  1 | SIMPLE      | t2    | range |     | Using where; Using join buffer |

The change was pushed into 5.1.19 tree. MySQL 5.1.18 is the only version that shows “Using join cache”. I’ve also updated the original explanation post to display the final variants of EXPLAIN output.