Slides: New subquery optimizations in MySQL 6.0 (new revision)

I’m now at MySQL Conference and have just finished my New Subquery Optimizations in MySQL 6.0 talk. If you’re eager to get the slides, here is the link to the pdf file and a mirror.

The talk has the same name as this MySQL University session but the slides have been thoroughly re-worked, and also there is new content:

  • More detailed explanation of subquery handling in 4.1/5.x
  • More reliable/justified benchmark numbers
  • Observations about subquery demographics
  • An attempt at comparative analysis of how MySQL’s subquery strategies compare to PostgreSQL’s
    • And a slide #28 that explains why we’re looking at PostgreSQL all the time

Have a nice viewing. And, if you get subqueries that should be fast but aren’t, we’re now actively looking at every subquery performance bug we get and will appreciate any input.

On possibility to have more than 64 indexes

I was asked on Sun Tech Days whether it is possible to use more than 64 indexes in MySQL. I’ve asked my colleague, Ingo Strüwing, and found out the following: yes, it is possible to have more than 64 indexes, and no additional patches are required. You can get 5.0 source tarball, ./configure it with --with-max-indexes=128 and you’ll get a server that can use up to 128 indexes. One needs to be cautious though:

  • --with-max-indexes=N option looks like you could configure server with arbitrary maximum number of indexes. That’s not true, if you use N>128 then the server will compile but it will not even be able to start.
  • 128-indexes build of server is not widely used, so potentially it can be less stable than the regular build. My experience: I compiled current 5.0 source with --with-max-indexes=128 and it has passed all bundled tests (good) but I don’t see any tests that would use more than 64 indexes (bad). I’d recommend to do some testing before you roll out anything of value on 128-indexes build.
  • Work on 128-indexes version is documented here: BUG#10932.

Hope this information was helpful.

Sun Tech Days St. Petersburg

I’m now at Sun Tech Days St Petersburg. MySQL got involved in it on a very short notice so there are no MySQL sessions or tutorials, and we only have a small booth. It’s decently manned though, there are six MySQLers here in total and we have no less than three people on the booth at nearly all times.

We organized some ad hoc sessions, so far, Alexey (aka Kaamos)’s talk about MySQL and dtrace attracted the most listeners, and Kostja was the second with his talk about new MySQL 5.1 features.

We get lots of questions on wide range of topics, the most popular are

  • Is Sun going to change MySQL license (close the source)? People are pleased to learn that MySQL will remain GPL.
  • How is MySQL development going to co-exist with PostgreSQL development inside Sun?
  • What, exactly, are MySQL licensing terms? People are not sure whether they will be able to use GPL or will need an OEM license, and what are the OEM licensing conditions and terms. It would be nice if MySQL website had more details about OEM agreements terms.

We get a fair number of technical questions as well. Here’s a quick dump in no particular order:

  • Does MySQL have special optimizations for Sun Niagara processors? What are the settings one should use when running on Niagara?
  • A complaint about poor subquery performance for a query generated by Hibernate. We figured that the issue is addressed in MySQL 6.0 by WL#3740 table pullout optimization.
  • Quite a few people are interested in in-memory transactional databases. Not all of them know about MySQL Cluster.
  • Most people are not aware of existence of MySQL Forge or MySQL University. People were interested to see slides, talks and other similar media, we’re handing out pieces of paper with MySQL Forge urls.
  • How does one create a load-balanced and/or fault-tolerant multi-machine MySQL installation? There is MySQL Proxy and there are 3rd-party solutions, there are also clustered JDBC drivers, but people would like to see a packaged or officially supported configurations. It would also be nice if there was an overview of available solutions and their relative properties.
  • MySQL’s command line client has poor tab completion in comparison to PostgreSQL’s. Indeed, hitting <TAB> in /usr/bin/mysql client always completes the list of tables and columns while in /usr/bin/psql completion is context-sensitive, e.g. if you type ALTER and hit TAB you’ll get a list of continuations available ALTER commands.
  • Is there a way to have more than 64 indexes?
  • Does MySQL’s GIS implementation allow to efficiently select primitives that are contained within (or have non-empty intersection with) a given axis-aligned rectangle? WL#3126 GIS: Precise spatial operations source tree needs to have some user-level documentation.
  • mysqldump and charsets. Quite a few people complained about having difficulties when doing export and import of databases in cp1251.
  • There needs to be some tool to assist upgrades. Several attendees said they were unable to upgrade their applications: the application stops working when moved to a newer MySQL version and they can’t figure out what went wrong. Ideally, they would like to have a tool that would just take their source code and update it with a newer MySQL, but that seems to be an “AI-complete” task. As a realistic option, we might use Proxy to run all queries against both old and new server, and report the differences.
  • MySQL cannot use index_merge over both fulltext and non-fulltext indexes, that is, it will use only one index (actually the fulltext one) for query like
    SELECT count(*) FROM tbl
    WHERE match(ft_key) AGAINST ('pizza' IN BOOLEAN MODE) AND non_ft_key='bar'

The conference is definitely worth to have presence on, also it’s nice to have enough people so we don’t have a strict booth watch schedule and there is always some rested force on the booth. Day-wise, the first day was the busiest, starting from the very opening hours. The second day had a slower start and even now (13:00) the crowd is not as thick. The good side of that is that it’s easier to have longer and uninterrupted discussions with interested attendees.

Correlated semi-join subqueries and PostgreSQL

The work on subquery optimizations goes on, and we’re already curious how the assortment of new 6.0 subquery optimizations compares to what is found in other DBMSes. MySQL’s new strategies are not DBMS theory breakthroughs, similar ideas have been implemented in other DBMSes, but when you take this list of strategies and put them into this kind of optimizer, the result has no direct equivalents in any other DBMS (or at least we believe so).

The first thing we did was to take a look at PostgreSQL as it is easily available and seems to have at least decent subquery handling (or even better than decent, I have not heard much complaints). And the first interesting difference was handling of correlated subqueries. With exception of materialization, MySQL’s new subquery strategies do not care if the subquery is correlated or not.

For example, let’s look at table pullout: First let’s try an uncorrelated subquery:

mysql> explain select * from tbl1 where col1 in (select primary_key from tbl2 where col2  show warnings\G
...
Message: select `test2`.`tbl1`.`col1` AS `col1`,`test2`.`tbl1`.`col2` AS `col2` from (`test2`.`tbl2`) join `test2`.`tbl1` 
where ((`test2`.`tbl1`.`col1` = `test2`.`tbl2`.`primary_key`) and (`test2`.`tbl2`.`col2` < 20))

and we see that it has been converted into an inner join (indicated by the part marked in red). Now let’s make the subquery correlated:

mysql> explain select * from tbl1 where col1 in (select primary_key from tbl2 where
    -> col2  show warnings\G
...
Message: select `test2`.`tbl1`.`col1` AS `col1`,`test2`.`tbl1`.`col2` AS `col2` from (`test2`.`tbl2`) join `test2`.`tbl1`
where ((`test2`.`tbl2`.`primary_key` = `test2`.`tbl1`.`col1`) and (`test2`.`tbl2`.`col2` = `test2`.`tbl1`.`col2`) and
 (`test2`.`tbl1`.`col2` < 20))

The query execution plans are different (because of the extra equality) but we see that subquery was converted into join in both cases.

Now let’s try PostgreSQL: first, uncorrelated subquery:

psergey=# explain select * from tbl1 where col1 in (select primary_key from tbl2 where col2   Index Scan using tbl1_col1 on tbl1  (cost=0.00..56.66 rows=600 width=8)
   ->  Index Scan using tbl2_pkey on tbl2  (cost=0.00..503.25 rows=12 width=4)
         Filter: (col2 < 20)

Ok, a decent plan. Now, let’s make the subquery correlated:

psergey=# explain select * from tbl1 where col1 in (select primary_key from tbl2 where
psergey(# col2   Seq Scan on tbl2  (cost=0.00..200.00 rows=1 width=4)
           Filter: ((col2 < 20) AND (col2 = $0))

and we can see that PostgreSQL starts using the “naive” approach, where the subquery predicate is evaluated using a direct, straightforward fashion and is not used for any optimizations.

So the first impression is that MySQL kinda wins this round. But wait, if you look at user subquery case analysis here, you’ll see that the vast majority of semi-join subqueries are uncorrelated. I wonder if PG folks at some point in the past have made a conscious decision not to bother optimizing correlated subqueries.

For uncorrelated subqueries, it’s different. Our first impression is that PostgreSQL has a powerful tool – hash join and its subquery variants, which it uses to get decent performance in wide variety of cases. MySQL does not have hash join (Hi Timour? Kaamos?), but the new subquery optimizations can have kinda-similar behavior. It is not clear yet whether they’ll be a good subquery hash join substitute. I intend to post the analysis here once I figure it out. Stay tuned.

Slides: New subquery optimizations in MySQL 6.0

A bunch of (hopefully) self-explanatory slides about new subquery optimizations in MySQL 6.0 is available here (UPDATE: here’s a working link). The slides are from this MySQL University session, so there was an audio stream but there were some difficulties with it and it is not available now.

If you miss the audio, the next opportunity to hear about this topic is the New Subquery Optimizations in MySQL 6.0 session I’m giving at the MySQL User Conference on April, 15th.

If you don’t miss the audio, please check out the mentioned session anyway – we already have more content than there are in the slides, and I expect to get even more till the MySQL UC.

 

Observations about subquery use cases

As I wrote earlier, we (me and Ranger) have done an assessment of the impact of new 6.0 subquery optimizations. First, we’ve done a search for performance issues in MySQL’s bug database. The raw list is here, and the here is a summary chart:

Search for customer issues in MySQL’s issue tracker system has produced a similar picture (raw list):

Search in DBT-{1,2,3,4} open source benchmark suites produced a radically different result though (raw data is here):

The samples are small but I think one can already conclude that the benchmarks and user cases have different kinds of subqueries. The cause of the discrepancy is not clear. Are people’s applications different from what the benchmarks simulate? If yes, what is the benchmark that simulates the apps our users are running?

A dumb search for subqueries in random open source applications using grep didn’t produce much. Again, no idea why, either I was looking at the wrong applications, or it could be that applications that use subqueries do not have SQL statements embedded in their code so one can’t find them with grep, or something else.

So far the outcome is that it’s nice that the new optimizations [will] capture a good fraction of real world cases, but there is no benchmark we could run to check or demonstrate this. Well, I’d still prefer this situation over the the opposite.

Evgen's CONNECT BY patch is in… PostgreSQL (oops, actually not. more details inside)

Evgen Potemkin is a developer in query optimizer team at MySQL, we’ve been working together for about 3 years now. So, it was a bit unexpected to read in the PostgreSQL news that (originally) his CONNECT BY patch has made it into PostgreSQL. No, we don’t have people coding for two DBMSes at the same time (yet?), the patch has been developed years before Evgen has joined MySQL.

One of the questions raised by the prospect of becoming a part of Sun was whether/how we will handle being a part of company that develops other DBMS as well. I suppose this patch news sets a good start.

UPDATE
It turns out the patch is not in the PostgreSQL tree after all. It is just people at http://www.postgresql.at (which is NOT just a different name for http://www.postgresql.org, and that was my error) merged the patch into a 8.3 version of PostgreSQL and made the result available to download).

Sorry for creating confusion, and thanks to those who pointed this out in the comments.

Subquery optimizations in MySQL 6.0

I’ve finally found time to put together a couple of wiki pages describing what work we’re doing for subquery optimizations in MySQL 6.0:

  • The Subquery_Works page has an overview of what we’ve done so far, we’re doing now and our future plans.
  • The 6.0 Subquery Optimization Cheatsheet has a short, easy-to-read description of what has been released in MySQL 6.0.3. That should be enough if you just want to get the new version and see if your subquery is/can be faster now or not. If you want to know more details, the nearest occasion is my MySQL University session on February, 28.
  • I’ve done a quick preliminary assessment of the impact of new optimizations. The raw results and first conclusions are here on the 6.0 Subquery Optimization Benchmarks page. We will be doing more analysis and I intend to post more observations based on the already collected data, stay tuned.

That’s everything I have on subqueries so far, but as I’ve already mentioned more is on the way.

Pluggable storage engine interface needs to support table name resolution hooks

I’ve started some attempts at coding ha_trace storage engine I’ve mentioned earlier. One of the first things that became apparent was that I needed a way to put a hook into table name resolution code so I can wrap tables into ha_trace objects.

The need to intercept table name resolution and do something other than looking at the .frm files is not unique to call trace engine:

  • Remote storage engines would benefit also:
    • NDB has a whole chunk of code that ships .frm files from one connected mysqld instance to another. It doesn’t hook into name resolution; it ships table definitions proactively, which could be nuisance if you use a new mysqld node to just connect and run a few queries
    • Federated relies on the user to issue CREATE TABLE statement that matches remote table’s DDL. There are no checks that table defintions match, you can do “interesting” things like omitting columns or having index definitions that don’t match the indexes on the remote table. I think this has little or no value, I’d more appreciate an approach where I provide remote server/database/table name and the server figures out the DDL of the remote table itself.
  • Storage engines that have their own data dictionary could relieve MySQL from the need to use .frm files. I remember a discussion about somebody who had lots (tens of thousands iirc) of tables and was running into issues because of the need to open all those .frm files.

I think the above justifies adding name resolution hooks into storage engine interface. I don’t have enough time/cause to implement it, my intent is to do tracing, not name resolution. If you think you’ll also benefit from the name resolution hook, please post a note here or at internals@mysql.com, perhaps we could collect enough voices to get it implemented.

Bugfix: DATE vs DATETIME comparisons are now sargable again

We (me coding, Igor and Evgen reviewing) have fixed BUG#32198. According to Support/Bugs team, the bug was causing pain to a number of people, so I thought it deserved a post. The problem was as follows:

  • Before MySQL 5.0.42, comparisons in form
      date_column CMP datetime_const

    were interpreted as comparisons of DATEs (CMP is one of =, <, > , <=, or >=). The time part of datetime_const was ignored.

  • In 5.0.42 we’ve fixed it to perform in a way that is closer to the SQL standard: the comparisons are now resolved by comparing values as DATETIMEs. Our fault was that “date_column CMP datetime_const” stopped being sargable. It turned out there quite a few people who had queries like
      SELECT ... WHERE date_column < NOW() ... 

    and those queries became awfully slow.

  • In 5.0.54, We’ve fixed BUG#32198 and made “date_column CMP datetime_const” sargable again. Apologies to everyone who was affected.

What keeps me concerned is that both old and new interpretation of DATE/DATETIME comparisons are not compatible with PostgreSQL. PostgreSQL does something strange: comparison of DATE column with DATETIME constant is interpreted as DATETIME comparison, while comparison with NOW() seems to be interpreted as comparison of DATEs:

MySQL PostgreSQL
mysql> select now();
+---------------------+
| now()               |
+---------------------+
| 2007-12-16 18:24:12 |
+---------------------+
1 row in set (0.01 sec)
test=# select now();
              now
-------------------------------
 2007-12-16 18:23:47.197907+03
(1 row)

OK

mysql> select a from tbl where date_col=now();
Empty set (0.02 sec)
test=# select a from tbl where date_col=now();
 a
---
(0 rows)

OK

mysql> select a from tbl
    -> where date_col='2007-12-16 18:24:12';
Empty set (0.00 sec)
test=# select a from tbl
test-# where date_col='2007-12-16 18:24:12';
     a
------------
 2007-12-16
(1 row)

bummer. Why is explicitly specified DATETIME value handled differently from NOW()?

I don’t know what is the logic behind this (or is this a bug in PostgreSQL?) If you do, please let me know.

In my SQL code, I think I’m going to play it safe and always explicitly cast to either DATE or DATETIME.