Correlated semi-join subqueries and PostgreSQL

The work on subquery optimizations goes on, and we’re already curious how the assortment of new 6.0 subquery optimizations compares to what is found in other DBMSes. MySQL’s new strategies are not DBMS theory breakthroughs, similar ideas have been implemented in other DBMSes, but when you take this list of strategies and put them into this kind of optimizer, the result has no direct equivalents in any other DBMS (or at least we believe so).

The first thing we did was to take a look at PostgreSQL as it is easily available and seems to have at least decent subquery handling (or even better than decent, I have not heard much complaints). And the first interesting difference was handling of correlated subqueries. With exception of materialization, MySQL’s new subquery strategies do not care if the subquery is correlated or not.

For example, let’s look at table pullout: First let’s try an uncorrelated subquery:

mysql> explain select * from tbl1 where col1 in (select primary_key from tbl2 where col2  show warnings\G
...
Message: select `test2`.`tbl1`.`col1` AS `col1`,`test2`.`tbl1`.`col2` AS `col2` from (`test2`.`tbl2`) join `test2`.`tbl1` 
where ((`test2`.`tbl1`.`col1` = `test2`.`tbl2`.`primary_key`) and (`test2`.`tbl2`.`col2` < 20))

and we see that it has been converted into an inner join (indicated by the part marked in red). Now let’s make the subquery correlated:

mysql> explain select * from tbl1 where col1 in (select primary_key from tbl2 where
    -> col2  show warnings\G
...
Message: select `test2`.`tbl1`.`col1` AS `col1`,`test2`.`tbl1`.`col2` AS `col2` from (`test2`.`tbl2`) join `test2`.`tbl1`
where ((`test2`.`tbl2`.`primary_key` = `test2`.`tbl1`.`col1`) and (`test2`.`tbl2`.`col2` = `test2`.`tbl1`.`col2`) and
 (`test2`.`tbl1`.`col2` < 20))

The query execution plans are different (because of the extra equality) but we see that subquery was converted into join in both cases.

Now let’s try PostgreSQL: first, uncorrelated subquery:

psergey=# explain select * from tbl1 where col1 in (select primary_key from tbl2 where col2   Index Scan using tbl1_col1 on tbl1  (cost=0.00..56.66 rows=600 width=8)
   ->  Index Scan using tbl2_pkey on tbl2  (cost=0.00..503.25 rows=12 width=4)
         Filter: (col2 < 20)

Ok, a decent plan. Now, let’s make the subquery correlated:

psergey=# explain select * from tbl1 where col1 in (select primary_key from tbl2 where
psergey(# col2   Seq Scan on tbl2  (cost=0.00..200.00 rows=1 width=4)
           Filter: ((col2 < 20) AND (col2 = $0))

and we can see that PostgreSQL starts using the “naive” approach, where the subquery predicate is evaluated using a direct, straightforward fashion and is not used for any optimizations.

So the first impression is that MySQL kinda wins this round. But wait, if you look at user subquery case analysis here, you’ll see that the vast majority of semi-join subqueries are uncorrelated. I wonder if PG folks at some point in the past have made a conscious decision not to bother optimizing correlated subqueries.

For uncorrelated subqueries, it’s different. Our first impression is that PostgreSQL has a powerful tool – hash join and its subquery variants, which it uses to get decent performance in wide variety of cases. MySQL does not have hash join (Hi Timour? Kaamos?), but the new subquery optimizations can have kinda-similar behavior. It is not clear yet whether they’ll be a good subquery hash join substitute. I intend to post the analysis here once I figure it out. Stay tuned.

Slides: New subquery optimizations in MySQL 6.0

A bunch of (hopefully) self-explanatory slides about new subquery optimizations in MySQL 6.0 is available here (UPDATE: here’s a working link). The slides are from this MySQL University session, so there was an audio stream but there were some difficulties with it and it is not available now.

If you miss the audio, the next opportunity to hear about this topic is the New Subquery Optimizations in MySQL 6.0 session I’m giving at the MySQL User Conference on April, 15th.

If you don’t miss the audio, please check out the mentioned session anyway – we already have more content than there are in the slides, and I expect to get even more till the MySQL UC.

 

Observations about subquery use cases

As I wrote earlier, we (me and Ranger) have done an assessment of the impact of new 6.0 subquery optimizations. First, we’ve done a search for performance issues in MySQL’s bug database. The raw list is here, and the here is a summary chart:

Search for customer issues in MySQL’s issue tracker system has produced a similar picture (raw list):

Search in DBT-{1,2,3,4} open source benchmark suites produced a radically different result though (raw data is here):

The samples are small but I think one can already conclude that the benchmarks and user cases have different kinds of subqueries. The cause of the discrepancy is not clear. Are people’s applications different from what the benchmarks simulate? If yes, what is the benchmark that simulates the apps our users are running?

A dumb search for subqueries in random open source applications using grep didn’t produce much. Again, no idea why, either I was looking at the wrong applications, or it could be that applications that use subqueries do not have SQL statements embedded in their code so one can’t find them with grep, or something else.

So far the outcome is that it’s nice that the new optimizations [will] capture a good fraction of real world cases, but there is no benchmark we could run to check or demonstrate this. Well, I’d still prefer this situation over the the opposite.

Evgen's CONNECT BY patch is in… PostgreSQL (oops, actually not. more details inside)

Evgen Potemkin is a developer in query optimizer team at MySQL, we’ve been working together for about 3 years now. So, it was a bit unexpected to read in the PostgreSQL news that (originally) his CONNECT BY patch has made it into PostgreSQL. No, we don’t have people coding for two DBMSes at the same time (yet?), the patch has been developed years before Evgen has joined MySQL.

One of the questions raised by the prospect of becoming a part of Sun was whether/how we will handle being a part of company that develops other DBMS as well. I suppose this patch news sets a good start.

UPDATE
It turns out the patch is not in the PostgreSQL tree after all. It is just people at http://www.postgresql.at (which is NOT just a different name for http://www.postgresql.org, and that was my error) merged the patch into a 8.3 version of PostgreSQL and made the result available to download).

Sorry for creating confusion, and thanks to those who pointed this out in the comments.

Subquery optimizations in MySQL 6.0

I’ve finally found time to put together a couple of wiki pages describing what work we’re doing for subquery optimizations in MySQL 6.0:

  • The Subquery_Works page has an overview of what we’ve done so far, we’re doing now and our future plans.
  • The 6.0 Subquery Optimization Cheatsheet has a short, easy-to-read description of what has been released in MySQL 6.0.3. That should be enough if you just want to get the new version and see if your subquery is/can be faster now or not. If you want to know more details, the nearest occasion is my MySQL University session on February, 28.
  • I’ve done a quick preliminary assessment of the impact of new optimizations. The raw results and first conclusions are here on the 6.0 Subquery Optimization Benchmarks page. We will be doing more analysis and I intend to post more observations based on the already collected data, stay tuned.

That’s everything I have on subqueries so far, but as I’ve already mentioned more is on the way.

Pluggable storage engine interface needs to support table name resolution hooks

I’ve started some attempts at coding ha_trace storage engine I’ve mentioned earlier. One of the first things that became apparent was that I needed a way to put a hook into table name resolution code so I can wrap tables into ha_trace objects.

The need to intercept table name resolution and do something other than looking at the .frm files is not unique to call trace engine:

  • Remote storage engines would benefit also:
    • NDB has a whole chunk of code that ships .frm files from one connected mysqld instance to another. It doesn’t hook into name resolution; it ships table definitions proactively, which could be nuisance if you use a new mysqld node to just connect and run a few queries
    • Federated relies on the user to issue CREATE TABLE statement that matches remote table’s DDL. There are no checks that table defintions match, you can do “interesting” things like omitting columns or having index definitions that don’t match the indexes on the remote table. I think this has little or no value, I’d more appreciate an approach where I provide remote server/database/table name and the server figures out the DDL of the remote table itself.
  • Storage engines that have their own data dictionary could relieve MySQL from the need to use .frm files. I remember a discussion about somebody who had lots (tens of thousands iirc) of tables and was running into issues because of the need to open all those .frm files.

I think the above justifies adding name resolution hooks into storage engine interface. I don’t have enough time/cause to implement it, my intent is to do tracing, not name resolution. If you think you’ll also benefit from the name resolution hook, please post a note here or at internals@mysql.com, perhaps we could collect enough voices to get it implemented.

Bugfix: DATE vs DATETIME comparisons are now sargable again

We (me coding, Igor and Evgen reviewing) have fixed BUG#32198. According to Support/Bugs team, the bug was causing pain to a number of people, so I thought it deserved a post. The problem was as follows:

  • Before MySQL 5.0.42, comparisons in form
      date_column CMP datetime_const

    were interpreted as comparisons of DATEs (CMP is one of =, <, > , <=, or >=). The time part of datetime_const was ignored.

  • In 5.0.42 we’ve fixed it to perform in a way that is closer to the SQL standard: the comparisons are now resolved by comparing values as DATETIMEs. Our fault was that “date_column CMP datetime_const” stopped being sargable. It turned out there quite a few people who had queries like
      SELECT ... WHERE date_column < NOW() ... 

    and those queries became awfully slow.

  • In 5.0.54, We’ve fixed BUG#32198 and made “date_column CMP datetime_const” sargable again. Apologies to everyone who was affected.

What keeps me concerned is that both old and new interpretation of DATE/DATETIME comparisons are not compatible with PostgreSQL. PostgreSQL does something strange: comparison of DATE column with DATETIME constant is interpreted as DATETIME comparison, while comparison with NOW() seems to be interpreted as comparison of DATEs:

MySQL PostgreSQL
mysql> select now();
+---------------------+
| now()               |
+---------------------+
| 2007-12-16 18:24:12 |
+---------------------+
1 row in set (0.01 sec)
test=# select now();
              now
-------------------------------
 2007-12-16 18:23:47.197907+03
(1 row)

OK

mysql> select a from tbl where date_col=now();
Empty set (0.02 sec)
test=# select a from tbl where date_col=now();
 a
---
(0 rows)

OK

mysql> select a from tbl
    -> where date_col='2007-12-16 18:24:12';
Empty set (0.00 sec)
test=# select a from tbl
test-# where date_col='2007-12-16 18:24:12';
     a
------------
 2007-12-16
(1 row)

bummer. Why is explicitly specified DATETIME value handled differently from NOW()?

I don’t know what is the logic behind this (or is this a bug in PostgreSQL?) If you do, please let me know.

In my SQL code, I think I’m going to play it safe and always explicitly cast to either DATE or DATETIME.

Slides and audio: How MySQL Handles ORDER BY, GROUP BY, and DISTINCT

I haven’t posted anything for a while, but that doesn’t mean that nothing was happening around MySQL Optimizer. In fact, exactly the reverse was true – so many thing were happening that I didn’t have time to blog.

I’ll try to make up for the lost coverage. Let’s start from smaller things. On Nov, 1 I’ve had a MySQL University session titled How MySQL Handles ORDER BY, GROUP BY, and DISTINCT.

The session page has slides and audio. If you’re ready to spend 1 hour on MySQL’s ways of doing ordering, you can listen to the audio. Otherwise, you can just scroll through the slides. They are reasonably self-contained and explain MySQL’s ORDER BY and GROUP BY handling in easy-to-digest colorful pictures.

How MySQL executes ORDER BY

In last couple of weeks there has been a tide of ORDER/GROUP BY-related optimization bugs, where I was the fixer or the reviewer. This wasn’t an easy job because there is no sane description of how GROUP BY/ORDER BY handling is supposed to work.

To figure it out, I had to write an explanation of how it works. The first part is about ORDER BY. Hopefully there will be subsequent parts that will show how GROUP BY is related to ORDER BY and how it works.

Available means to produce ordered sequences

MySQL has two methods to produce ordered streams.

The first is to use a “range“, “ref” or “index” access method over an ordered index. For versions up to 5.1, those access methods naturally return records in the index order, so we get ordering for free (the exception is NDB engine which needs to do merge-sort when it gets data from several storage nodes). In MySQL 5.2, MyISAM and InnoDB have MultiRangeRead optimization which breaks the ordering. We have a draft of how to make it preserve the ordering, but at the moment MRR is simply disabled whenever ordering is required.

The second, catch-all method, is to use the filesort algorithm. In a nutshell, filesort() does quicksort on chunks of data that fit into its memory and then uses mergesort approach to merge the chunks. The amount of memory available to filesort() is controlled by @@sort_buffer_size variable. if the sorted data doesn’t fit into memory (i.e. there is more than one chunk), filesort uses a temporary file to store the chunks.

Source data for filesort() always comes from one table. If there is a need to sort data from several tables, MySQL will first collect the data into a temporary table and then invoke filesort() for that temporary table. I don’t know the true reason for this. Codewise, filesort() wants to pull its source data using something like source.get_next_record() function, while join and union runtime produce their using result.put_next_record()-type calls, so maybe the temporary table is there only to resolve this push/pull mismatch and will go away once we get decent cursors.

filesort() has two modes of operation:

  1. Mode 1: the sorted elements contain all required columns of the source table. The result of the sorting is a linear sequence of output tuples, there is no need to access the source table after the sort is done.
  2. Mode 2: sort <sort_key, rowid> pairs and produce a sequence of rowids which one can use to get source table’s rows in the required order (but this will be essentially hit the table in random order and is not very fast)

Mode 1 is used whenever possible. Mode is used when mode1 is not applicable. This is the case when the sorted tuples have blobs or variable-length columns (TODO: check w/ evgen). Unfortunately, the EXPLAIN output provides no clue about which mode is used, so you’ll have to manually check for blobs in the output columns list.

Executing join to produce ordered stream

At the moment MySQL has three ways to run a join and produce ordered output:

Method EXPLAIN output
Use index-based access method that produces ordered output no mention of filesort
Use filesort() on 1st non-constant table “Using filesort” in the first row
Put join result into a temporary table and use filesort() on it “Using temporary; Using filesort” in the first row

Now I’ll cover those three methods in more detail. The first method can be used when the first non-const table in the join order has an index that matches the ORDER BY list. In this case we can use the index, and the ordering will “spread over” other tables onto the output. Here is a swimlane diagram of this process, where different columns represent different values of the ORDER BY expression:
diagram
This method is preferred over the other two as it requires no additional sorting steps.

The second method can be used when all ORDER BY elements refer to the first table in the join order. In this case, we can filesort() the first table and then proceed to execute the join:
diagram
Here filesort() may operate either in Mode 1 or in Mode 2. One may wonder why this is limited to doing filesort() after the first table. After all, we could do it after the second table as well – produce (tbl1, tbl2) record combinations, put them into temporary table, sort, etc. The expectation is perfectly reasonable but alas, MySQL will not even consider such query plans.

The last, the catch-all method is to write the entire join output into the temporary table and then invoke filesort:
diagram

I have an easier time recalling those three strategies when there are pictures, hopefully they’ll help you too. That’s all for today, in next posts I’ll cover the topics of how ordering affects join optimization and interoperates with GROUP BY.

An idea: create ha_trace tracing storage engine

Our exprience in solving query optimizer problems shows that a good chunk of optimization problems are incorrect choice of join order. The most frequent causes of the problems are

  1. Table engine returns index statistics or records-in-range numbers that are very far from reality;
  2. The WHERE clause has high-selectivity conditions for one table and low-selectivity conditions for the other. The optimizer is not aware of this and chooses a poor join order;
  3. Bugs or shortcomings in the optimizer or the table engine;

At the moment investigation of those kinds of problems is hard and time-consuming:

  • There is no easy way to find out what the storage engine has returned to the optimizer from records-in-range calls. One has to manually repeat the steps taken by equality propagation, constant table detection and other query rewrites, construct the table’s condition and run the corresponding single-table EXPLAIN;
  • You have to construct and run separate queries to check if the table/index statistics matches the actual distribution of the data examined by the query (far-off statistics values for skewed data distribution is a frequent problem);
  • There is no easy way to obtain information about the selectivity of the table’s condition. You either need a debugger or, in some cases, you can manage with doing non-trivial calculations over the values of Handler_read_XXX counters.

A completely different problem is that now some storage engines are provided by other parties than MySQL, and we’ll need to be able to find out who is at fault when something doesn’t work with a 3rd-party engine.

Having spent some time thinking about both of the above, I’ve got an idea that both can be resolved by a single solution – write a tracing storage engine.

The ha_trace engine

The engine will be used this way:

  1. You take the tables that are involved in the problematic query and “wrap” them in ha_trace tables:
    CREATE TABLE trace_tbl1 (... copy the DDL ...) ENGINE=TRACE UNDERLYING_TBL='tbl1' 

    The trace tables look and can be queried exactly like their underlying tables

  2. You run the problematic query againist the wrapper tables. The trace engine intercepts all storage engine calls and either just records them or accumulates some summary information.
  3. You obtain the trace results by querying some special $trace$results table.

The trace engine can do many practically useful things:

Statistics checker mode
This mode will check if the engine statistics numbers are any good. ha_trace will intercept statistics and record retrieval API calls and compile a report on how much were the differences between statistics and the observed table data. It could also build histogram of the observed data, which would give a clue about whether running ANALYZE TABLE would help.

Call trace mode
This is similar to what ltrace(1) does. The advantage over ltrace is that ha_trace knows the meaning of the calls and can produce more meaningful output. In particular, it can provide smarter call filtering and compression of repetitive call patterns.

Call trace logs can be used to check if the performed calls are compliant with the storage engine API and find out who is the violator, SQL Layer or the engine.

Query trace mode
This is similar to call trace but more high level. It will provide a view of how query was executed, giving information about how many rows were scanned in each table and so forth. We’ll need to use some kind of compression to reduce the size of the collected data. I’m thinking of building a regular expression/automaton with counters in the edges, so the collected data looks like this:

  t1.rnd_init(TRUE /*do sequential scan*/)= 0
  t2.index_init(key1);
  1000 * {
      t1.rnd_next()= 0
      t2.index_read(...)= {0 - 700 times, EOF=300 times}
      20 (min=1,max=200) * { t2.index_next_same() }
   }
   t2.index_end()=0;
   t1.index_end()=0;

If you haven’t already figured out how the EXPLAIN of this query looks, here it is:

explain select * from t1, t2 where t2.key=t1.col;
+----+-------------+-------+------+---------------+------+---------+---------+------+-------+
| id | select_type | table | type | possible_keys | key  | key_len | ref     | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+---------+------+-------+
|  1 | SIMPLE      | t1    | ALL  | NULL          | NULL | NULL    | NULL    | 1000 |       |
|  1 | SIMPLE      | t2    | ref  | a             | a    | 5       | t1.key1 |   22 |       |
+----+-------------+-------+------+---------------+------+---------+---------+------+-------+

The trace shows a lot of useful information:
* In 300 of 1000 cases, table t2 had no matching records
* When there were matches, there were on average 20 matches
* There was an off-the-scale case when there were 200 matching rows
And you can just see all this, without the doing arithmetic exercises over ‘Handler_read_%’ status variables.

Whether this is the right solution

Wrapper engine is the natural solution for troubleshooting problems between storage engines and the SQL layer (think of ltrace, strace, ODBC trace, etc etc). It is a reasonably good query trace solution too:

  • Per-table and per-query tracing is a fine level of granularity that is big step ahead of counters
  • ha_trace adds exactly zero overhead when it is not enabled
  • Works for all current and future storage engines
  • ha_trace is probably the only tracing solution that doesn’t require making changes all over the code.

.
.
.
Having written all this in one sitting (well, also some subway standing), I’ll now try to spend the remainder of this weekend in a less geeky way.