Ongoing MariaDB development: filtering and rewrites in mysqlbinlog

The main focus of MariaDB staff has been the MariaDB 5.1.38 release in recent weeks, but this doesn’t mean that we’ve abandoned everything else for it. There are several non-release projects going on, one of which is adding binlog filtering and markup capabilities.

In order to see how the new features fit in, let’s first look at what binlog filtering options are already present in MySQL (and so, MariaDB):

kind Master Slave mysqlbinlog
DB-level filtering –binlog-do-db
Table-level filtering –replicate-do-table=db.tbl
Database name rewrite –replicate-rewrite-db=”from->to”
Statement-verb filtering

As long as MySQL had only statement-based replication, one could work around the blank space in mysqlbinlog column by processing mysqlbinlog output with perl/awk/etc scripts. With Row-Based Replication, mysqlbinlog’s output contains events that look like this:


which practically prevents one from doing any processing on it with perl/awk or similar tools. We’ve got a request to fix this, and set to add the following filtering capabilities:

kind Master Slave mysqlbinlog
DB-level filtering –binlog-do-db
Table-level filtering –replicate-do-table=db.tbl
Database name rewrite –replicate-rewrite-db=”from->to” MWL#36
Statement-verb filtering MWL#41

At this moment, MWL#36 has already been coded and pushed into mariadb-5.2 tree. The rest of the tasks will hopefully follow.

Some implementation notes

MWL#36’s --replicate-rewrite-db has the same limitations as slave’s --replicate-rewrite-db: cross-database updates and CREATE/DROP/ALTER DATABASE statements are not rewritten. We were lucky that slave’s replicate-rewrite-db had these limitations and we could follow them. The thing is, since the slave parses the queries, it is relatively easily for it to walk the parse tree and rewrite the database name wherever necessary (and thus provide handling all kinds of statements). If this was implemented on the slave, it would be very difficult to repeat in mysqlbinlog, since mysqlbinlog has no SQL parser and so is not able to reliably find references to database name in SQL statement text.

It seems we won’t be able to dodge this problem in MWL#40: Table-level filtering, though. According to the manual, replicate-ignore-table “Tells the slave thread to not replicate any statement that updates the specified table, even if any other tables might be updated by the same statement (source). In order to copy this behaviour, mysqlbinlog will need to be able to tell which tables are affected by each of the statements it processes. As written in the worklog entry, so far we see three reliable ways to do that

  • Include MySQL’s SQL parser into mysqlbinlog
  • Have the master annotate the statements with easily-parseable information about which tables are updated in the statement.
  • Do not try to solve the problem on mysqlbinlog side, delay it until the point where we do understand SQL. As the first option, let the server support @@ignore_tables session variable and then let mysqlbinlog print SET ignore_table=... as the first statement in its output.

All three approaches have certain drawbacks. The first seems like an overkill and will have no potential to ever be able to work for VIEWs. The second will increase the size of binary logs and won’t work for un-annotated binary logs produced by legacy servers. The third approach requires this @@ignore_tableshack in the server and doesn’t really do filtering, which might be a nuisance when one does some additional processing on mysqlbinlog‘s output … I’m still swaying.

MariaDB 5.1 feature: Table Elimination

MariaDB 5.1 beta is almost out, so it’s time to cover some of its features. The biggest optimizer feature is MWL#17 Table Elimination.

The basic idea behind table elimination is that sometimes it is possible to resolve the query without even accessing some of the tables that the query refers to. One can invent many kinds of such cases, but in Table Elimination we targeted only a certain class of SQL constructs that one ends up writing when they are querying highly-normalized data.

The sample queries were drawn from “Anchor Modeling”, a database modeling technique which takes normalization to the extreme. The slides at anchor modeling website have an in-depth explanation of Anchor modeling and its merits, but the part that’s important for table elimination can be shown with an example.

Suppose the database stores information about actors, together with their names, birthdates, and ratings, where ratings can change over time:

According to anchor modeling, each attribute should go into its own table:

-- the 'anchor' table which has only synthetic primary key
create table  ac_anchor(AC_ID int primary key);

-- a table for 'name' attribute:
create table ac_name(AC_ID int, ACNAM_name char(N),
                     primary key(AC_ID));

-- a table for 'birthdate' attribute:
create table ac_dob(AC_ID int,
                    ACDOB_birthdate date,
                    primary key(AC_ID));

-- a table for 'rating' attribute, which is historized:
create table ac_rating(AC_ID int,
                       ACRAT_rating int,
                       ACRAT_fromdate date,
                       primary key(AC_ID, ACRAT_fromdate));

With this approach it becomes easy to add/change/remove attributes, but this comes at a cost of added complexity in querying the data: in order to answer the simplest, select-star question of displaying actors and their current ratings one has to write outer joins:

-- Display actors, with their names and current ratings
  ac_anchor.AC_ID, ACNAM_Name,  ACDOB_birthdate, ACRAT_rating
  left join ac_name on ac_anchor.AC_ID=ac_name.AC_ID
  left join ac_dob on ac_anchor.AC_ID=ac_dob.AC_ID
  left join ac_rating on (ac_anchor.AC_ID=ac_rating.AC_ID and
                          ac_rating.ACRAT_fromdate = 
                            (select max(sub.ACRAT_fromdate)
                             from ac_rating sub where sub.AC_ID = ac_rating.AC_ID))

Apparently one won’t want to write such join every time they need to access actor’s properties, so they’ll create a view:

create view actors as
  select  ac_anchor.AC_ID, ACNAM_Name,  ACDOB_birthdate, ACRAT_rating
  from <see the select above>

which will allow one to access data as if it was stored in a regular way:

select ACRAT_rating from actors where ACNAM_name='Gary Oldman'

And this is where table elimination will be needed

Table elimination

The first thing the optimizer will do will be to merge the VIEW definition into the query and obtain

select ACRAT_rating
  left join ac_name on ac_anchor.AC_ID=ac_name.AC_ID
  left join ac_dob on ac_anchor.AC_ID=ac_dob.AC_ID
  left join ac_rating on (ac_anchor.AC_ID=ac_rating.AC_ID and
                          ac_rating.ACRAT_fromdate = 
                            (select max(sub.ACRAT_fromdate)
                             from ac_rating sub where sub.AC_ID = ac_rating.AC_ID))
 ACNAM_name='Gary Oldman'

Now, it’s important to realize that the obtained query has a useless part (highlighted in magenta). Indeed,

  • left join ac_dob on ac_dob.AC_ID=... will produce exactly one matching record:
    • primary key(ac_dob.AC_ID) guarantees that there will be at most one match for any value of ac_anchor.AC_ID,
    • and if there won’t be a match, LEFT JOIN will generate a NULL-complemented “row”,
  • and we don’t care what the matching record is, as table ac_dob is not used anywhere else in the query

This means that the … left join ac_dob on … part can be removed from the query and this is what Table Elimination module does. The detection logic is rather smart, for example it would be able to remove the … left join ac_rating on … part as well, together with the subquery (in the above example it won’t be removed because ac_rating used in the selection list of the query). Table Elimination module can also handle nested outer joins and multi-table outer joins.

User interface

One can check that table elimination is working by looking at the output of EXPLAIN [EXTENDED] and not finding there the tables that were eliminated:

MySQL [test]> explain select ACRAT_rating from actors where ACNAM_name='Gary Oldman';
| id | select_type        | table     | type   | possible_keys | key     | key_len | ref                  | rows | Extra       |
|  1 | PRIMARY            | ac_anchor | index  | PRIMARY       | PRIMARY | 4       | NULL                 |    2 | Using index |
|  1 | PRIMARY            | ac_name   | eq_ref | PRIMARY       | PRIMARY | 4       | test.ac_anchor.AC_ID |    1 | Using where |
|  1 | PRIMARY            | ac_rating | ref    | PRIMARY       | PRIMARY | 4       | test.ac_anchor.AC_ID |    1 |             |
|  3 | DEPENDENT SUBQUERY | sub       | ref    | PRIMARY       | PRIMARY | 4       | test.ac_rating.AC_ID |    1 | Using index |
4 rows in set (0.01 sec)

Note that ac_dob table is not in the output. Now let’s try getting birthdate instead:

MySQL [test]> explain select ACDOB_birthdate from actors where ACNAM_name='Gary Oldman';
| id | select_type | table     | type   | possible_keys | key     | key_len | ref                  | rows | Extra       |
|  1 | PRIMARY     | ac_anchor | index  | PRIMARY       | PRIMARY | 4       | NULL                 |    2 | Using index |
|  1 | PRIMARY     | ac_name   | eq_ref | PRIMARY       | PRIMARY | 4       | test.ac_anchor.AC_ID |    1 | Using where |
|  1 | PRIMARY     | ac_dob    | eq_ref | PRIMARY       | PRIMARY | 4       | test.ac_anchor.AC_ID |    1 |             |
3 rows in set (0.01 sec)

The ac_dob table is there while ac_rating and the subquery are gone. Now, if we just want to check the name of the actor

MySQL [test]> explain select count(*) from actors where ACNAM_name='Gary Oldman';
| id | select_type | table     | type   | possible_keys | key     | key_len | ref                  | rows | Extra       |
|  1 | PRIMARY     | ac_anchor | index  | PRIMARY       | PRIMARY | 4       | NULL                 |    2 | Using index |
|  1 | PRIMARY     | ac_name   | eq_ref | PRIMARY       | PRIMARY | 4       | test.ac_anchor.AC_ID |    1 | Using where |
2 rows in set (0.01 sec)

then it will eliminate both ac_dob and ac_rating tables.

Removing tables from a query is not expected to make the query slower, and it does not cut off any optimization opportunities, so we’ve made table elimination unconditional and don’t plan on having any kind of query hints for it.

I wanted to add an @@optimizer_switch flag anyway, just in case, but Monty was against that and eventually we have agreed that @@optimizer_switch will have table_elimination=on|off switch only in debug builds of the server.

Added a copy of mainline mysql-5.1 into our Buildbot

And it has failed the testuite on every single build slave. I’ve filed BUG#45605, BUG#45630 (together with a patch), BUG#45631, BUG#45632. There is also rpl.rpl_innodb_bug28430 test failure which I didn’t report as I don’t yet have enough details about the build slave.

At the moment our setup works as follows: there is lp:~maria-captains/maria/mysql-5.1-testing branch, which our copy of lp:mysql-server. We periodically pull from the main tree into our copy, it’s a manual process. Buildbot watches for pushes to our copy and runs builds/tests after every push. The results are publicly available here.

UPDATE: Sun people say their pushbuild is nearly green (=all tests pass, on nearly all platforms). This is very odd, as our build slaves are nothing special – about half of them are recent Ubuntu installs on most popular architectures.

Test failures in 5.1, different people making different fixes for the same problem

This is what’s happening at the moment. 5.1 tree doesn’t pass the tests, both Sun and Monty Program fix that, and the fixes are different. Here is a Valgrind warning which was fixed twice: my fix my fix (correct link) made the involved mysys function to do what its name implies while Alfranio’s fix changed the replication code not to call that function anymore.

Besides the Valgrind warning, we observe failures for rpl_trigger.test and query_cache_28249.test (if you follow the links you’ll have to grep for test name. Buildbot has some room for improvent). I get these failures in maria-5.1-table-elimination tree. The problem is that when the failure is random (the query cache one is) or when I get it after merging from main, I cannot easily tell whether

  • the problem is in my new code
  • the problem is in MariaDB
  • the problem is in the original MySQL
    • and they are not aware, or
    • they are aware and there is no fix yet
    • they are aware and have the fix in some team tree (merges from team trees to main can happen as rarely as once a month)

I think we’ll have to take the main branch and have our Buildbot run tests for it, too. We’d like to add all publicly available trees (when analyzing random test failures it is the more runs the better), but our small population of build slaves (volunteers welcome!) will not manage to do that many test runs.

Changed jobs, now at Monty Program AB

This isn’t news anymore – it has been over a month – but it would be odd not to mention this all, so here it goes: at the start of May I’ve left Sun Microsystems and joined Monty’s company.

The setup at Monty Program AB is quite similar – we have an IRC channel (#maria on FreeNode), a mailing list, bazaar trees on launchpad, Worklog and a Buildbot installation. It’s actually more open than at Sun/MySQL. At Sun, everyone is on internal IRC, external public can only see a subset of Worklog (the biggest problem with it is that it’s not possible to subscribe to changes), and their Buildbot-like system (it’s called PushBuild and looks like this) is not visible to the outside public.

There are actually very good reasons why an external person might want to look at pushbuild. Everyone doing development (if you count Summer of Code students and engine developers, that’s a lot of people) or just trying to use the newest features will at some point want to get the latest source from bazaar repository. And the problem here is that the trees get broken every once in a while, and when you are pulling the sources from launchpad it is nice to know what you’re pulling. You can run the tests yourself, sure, but that takes time. And if you do take time to run the tests, then if you see a failure you won’t know if Sun/MySQL is aware of the problem, whether it is repeatable on any computer or you need to report your OS/compiler/configure flags/test run parameters and so forth and so forth.

Getting back to things at Monty Program AB: my first tasks here are Table Elimination and index_merge improvements. I intend to cover them in more detail in a separate post.

Notes from Feature Request Bonanza session at Percona Performance conference

I was taking notes during “Open Q&A: Feature Request Bonanza” session at the Percona Performance conf. The session started at 9 pm at the last day of the conference, so the room wasn’t as full as it was for other sessions, but still there was an interesting discussion. I’ve missed several requests but more than 90% of stuff is there.

DISCLAIMER People are mentioned when I could both identify them (I was in the first row which rules out those in the back) and had time to note that, so names below are contact points and not indication of who was[n’t] there. I was also somewhat tired so please re-check the statements with their authors if you’re going to make any far-reaching conclusions based on the below:

  • The first request was for partial index support: create and use indexes that only have records that match a certain condition.
  • Pre-allocate space after table creation. Monty: CREATE TABLE statement has MIN_ROWS parameter already, and it’s honored by MyISAM.
  • PeterZ: Besides allocation of space at table creation, it would be nice if it was possible to allocate table’s space in extents. Monty: This is possible in Maria.
  • Somebody requests to specify fill factor for InnoDB pages. Domas: Technically it’s there, can be changed in gdb.
  • X: I want to specify the parameter for each table and index. Monty: MariaDB has support for name=value table parameters. They are passed to engine.
  • Monty: The engine should mark which options it recognized and warnings should be issued for unrecognized options.
  • Jeremy Cole requests online DDL. Monty: Maria has lazy add/remove column.
  • PeterZ asks about online ALTER TABLE. There’s a discussion about instant vs. online/background ALTER TABLE and what kind of operations can be performed instantly or online.
  • Domas would like to see online OPTIMIZE. It should be a background process which one start/stop or set to work at some limited rate so it can run without much impact on other activity. It should be possible to set to behave more aggressively.
  • PeterZ asks what OPTIMIZE should do for SSD drives. Somebody answers that SSD drives have high cost per megabyte of storage, so OPTIMIZE should reclaim wasted space.
  • Antony Curtis requests columns to have default values which are functions without use of triggres. Monty: this will require changes in .frm file format. They want MariaDB to be compatible.
  • Call for new datatypes anyone would like to see added
    • Ryan: column encryption
    • Plug-in abstract datatypes
    • Microsecond timestamp (Monty wants to add this)
    • PeterZ: blob/text data compression

    then there’s a discussion on where compression should be handled – inside storage engine or at connector level, or somewhere inside SQL runtime, etc. There are different opinions.

  • Somebody asks for more comprehensive features in general and a comprehensive set of DTrace probes in particular. There is a counterargument that it’s not possible to have static DTrace probes for every possible case and that one should use dynamic tracing. That requires knowledge of source code though.
    Monty requests a list of missing probes.
  • A request from Baron Schwartz: there are cases when MySQL will do ref access over columns of some unique index, but will use some non-unique index because it is covering all needed columns while the unique index isn’t. MySQL should use eq_ref access in such cases. He says he has run some benchmarks and there’s a 20% speed difference between ref and eq_ref. The engine doesn’t matter.
  • Susanne asks for DROP CASCADE. Domas doesn’t want anybody to run DROP CASCADE on his servers.
  • Ryan requests that index_merge optimizer is extended to allow sort-intersect strategy. At the moment we have just ‘intersect’ which can produce intersection of rowid-ordered index scans which means that it handles equality predicates:
    SELECT ... FROM tbl WHERE t.key1='foo' AND t.key=123
    but not range predicates:
    SELECT ... FROM tbl WHERE t.key1 LIKE 'foo%' AND t.key2 BETWEEN 123 AND 134

  • Alexey Rybak requests Bitmap indexes.
  • He also requests a fully asynchronous client library, one that would allow a client app can run many queries on many servers concurrently. It seems Drizzle has a new client library that does that. They don’t support binary protocol though. Monty intends to wait until Drizzle’s client library is stabilized, then add binary protocol support to it, and then see if it could be used instead of the standard library. Someone states that new PHP connector already supports asynchronous operation.
  • Somebody asks if there is any way to scale the RAM footprint of embedded MySQL. He says he has severe RAM (but not disk) constraints. I express doubt if database would work when OS has no space for disk cache but Monty says MyISAM is capable of operating reasonably decent in such settings.
  • Ryan requests InnoDB to have instant (auto-maintained) table checksums.
  • Domas requests “fuzzy replication”. Here I can’t make sense of my notes – it’s something about losing some of latest transactions but recovering to some consistent state, but I can’t remember how does all that relate to replication.
  • Ryan says it’s annoying that InnoDB takes everything offline when it detects corruption. He suggests that InnoDB should take offline only the corrupted table (which is feasible when one is using innodb_file_per_table option — sergeyp).
  • Monty says that Maria will shut down only the corrupted table and will automatically attempt to repair it. The audience wants the same to happen for partitions.
  • Somebody asks for something related to INSERT DELAYED. There’s a reply that that can be achieved in the application with SQL statement queuing.
  • Domas tells a story about Wikipedia having a number of tables with various counters, like number of pages in categories. They do update counters at the end of the transaction and in a number of cases counter update causes deadlock which otherwise would not occur. All changes to counters are commutative/additive/reversible actions, so it would be nice if the engine [or its extension] understood that used that knowledge to avoid deadlocks. Antony and Ryan mention they work around this problem by storing counters in non-transactional MyISAM tables
  • Somebody requests settable limits on how much memory a client can use. Domas mentions that the day before he has demonstrated how a half-megabyte query can consume gigs of RAM without using any buffers. BUG#42946 and BUG#27863 are mentioned as other examples of how one could cause excessive memory consumption with seemingly innocent statements. Besides that, queries consume engines’ internal buffer/cache resources and that is very difficult to account for. Monty says that the reality is that an experienced DBA is able to bring down the server.

That’s it. It will be interesting to get back to the list after one year and see if anything of it got implemented 🙂

Optimizer news: @@optimizer_switch syntax changes and backport

In short, the news are:

  • @@optimizer_switch support was backported into MySQL 5.1
  • The switch syntax was changed from ‘no_optimization_name’ to ‘optimization_name=on|off|default’.
  • Added switches for index_merge, index_merge_intersection, index_merge_union, and index_merge_sort_union optimizations.

The changes will be available in next releases, that is, MySQL 5.1.34 and 6.0.11.
Now with more details:

New switch names

Until now, the syntax mimicked an enum column or the @@sql_mode variable. One could set the value of @@optimizer_switch to a set of keywords, e.g.

SET @@optimizer_switch='no_semijoin,no_materialization';

Presence of a no_xxx keyword meant that its optimization was disabled, its absence meant it was disabled.

As of the next MySQL 5.1/6.0, @@optimizer_switch value is a set of on/off flags:

mysql> SELECT @@optimizer_switch;
| @@optimizer_switch                                                                        |
| index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on |

the SET statement accepts a list of commands:
SET [GLOBAL|SESSION] optimizer_switch='command,command,...'

where each command is one of

default reset all optimization settings to default
optimization_name=off disable the optimization
optimization_name=on enable the optimization
optimization_name=default set the optimization to its default state

The order of the commands does not matter (‘default’ will be executed first if present), setting the same flag twice within one SET command is not allowed. Flags that are not mentioned keep their current values:

mysql> SELECT @@optimizer_switch;
| @@optimizer_switch                                                                        |
| index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on |
1 row in set (0.00 sec)

mysql> SET optimizer_switch='index_merge_union=off,index_merge_sort_union=off';
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT @@optimizer_switch;
| @@optimizer_switch                                                                          |
| index_merge=on,index_merge_union=off,index_merge_sort_union=off,index_merge_intersection=on |
1 row in set (0.00 sec)

The advantages of the new way over the old one are that

  1. It is now possible to turn certain optimization on/off with a single statement (SET optimizer_switch=’malfunctioning_optimization=off’) which will not depend on what other optimizer flags exist and what their values are.
  2. One can easily see what optimizer switches are available in the current server.
  3. In contrast to the grand solution of WL#4046, I could code and push this within a reasonable amount of time.

In addition, the mysqld binary got a –optimizer-switch parameter which allows to set optimizer_switch at server startup or in my.cnf file.

The backport

This is actually the reason for making all these changes. It turns out that the optimizer can make a wrong choice when considering whether to use index_merge optimization. This can happen for both valid (unknown data correlations) and not-so-valid (mismatch between cost model and the reality) reasons. A fix for either of these problems would be too intrusive to put into the GA version (betas and major releases were invented for a reason), and we also just do not have it yet. So, we’ve decided to provide at least some resolution for those for whom index_merge made things worse and introduced the following switches

@@optimizer_switch flags in MySQL 5.1
index_merge turns on/off the all index_merge optimizations
index_merge_union turn on/off individual index_merge algorithms (names as in the documentation)

MySQL 6.0 has the above switches and also subquery optimization switches:

@@optimizer_switch flags in MySQL 6.0
semijoin turns on/off the all semi-join optimizations
materialization turns on/off materialization (including semi-join materialization)
loosescan turns on/off semi-join LooseScan strategy (not to be confused with GROUP BY’s LooseScan)
firstmatch turns on/off semi-join FirstMatch strategy

All future optimizations will be switchable as well. We’ve learned the lesson.

Sun Tech Days St. Petersburg 2009

Sun Tech Days St. Petersburg was on Wednesday-Thursday the last week and we’ve had a MySQL booth there. Unlike the last year, we’re full part of Sun now so managed to get a decent-sized booth, arrange for leaflets, and Kostja gave an overall MySQL talk.

Questions at the booth (in no particular order):

  • When will Connector.NET support LINQ Entity framework? (According to Reggie Burnett: it is currently supported in Connector 6.0/Beta, which is expected to be GA soon)
  • Can Connector.Net be used with Mono? (Yes)
  • Is Workbench available for Linux (Yes)
  • When will MySQL support stored procedures? This question seems to have replaced the infamous “When will MySQL support transactions” question.
  • What is the impact of different transaction isolation levels on performance of InnoDB? (no idea. If you ran some experiments please drop a comment)
  • When will MySQL support LIMIT clause inside subqueries, in particular the
    ... WHERE IN (SELECT ... ORDER BY LIMIT n) form? (we would like to add support for this, but no plans ATM. Request taken.)
  • Does MySQL have any limitations on table/database size or number of records in the table? (I gather people are used to having limitations in free versions of SQL Server or Oracle and expect something like that in MySQL)
  • When will fulltext search support searching for different wordforms (that’s a big deal for searching in Russian texts, as the words get different suffixes depending on which grammatic case they are in)? No plans ATM. Perhaps somebody has developed a fulltext parser plugin somewhere?
  • When will InnoDB get efficient support for COUNT(*)?
  • Are there any plans to make MySQL more efficient when handling big blob columns?

We’ve got several complaints that look like bugs:

  • LEFT JOIN and multi-table DELETE fails to delete records when using foreign keys. We’ve got a test case so I’ve filed this as BUG#44207.
  • There’s something wrong with Connector/Java and timezones. I’m not sure if we’ve managed to repeat the problem on our laptops, we were promised a bug report.
  • Another person complained about a sharp slowdown in join performance when table size(s) exceed 1M rows. According to the reporter, all buffers are adequately-sized, EXPLAIN shows that the query plan is the same. No idea what this could be then, as the SQL layer doesn’t have any hard-coded buffer sizes.

Also there was this Java duke guy:
and we’ve figured that it would be nice to get a MySQL dolphin the next year. And have the developer with biggest number of bugs in his code wear it :-).

More MySQL 6.0 news: next subquery optimizations WL pushed

Three days ago I’ve finally managed to push the code for WL#3985 “Subquery optimization: smart choice between semi-join and materialization” into MySQL 6.0. I missed the clone-off date so it won’t be in the upcoming MySQL 6.0.9 release, the only way to get it before the next 6.0.10 release is from the lp:mysql-server/6.0 bazaar repository.

What’s new in the push

Before WL#3985, 6.0’s subquery optimization had these three deficiencies:

  1. For semi-join (see cheatsheet for definition) subqueries, you had to make a choice between having the optimizer use materialization or all other strategies. The default behavior was not to use materialization, you could only get it by setting a server variable to disable all other strategies.
  2. The choice among other strategies (FirstMatch, DuplicateWeedout, LooseScan) wasn’t very intelligent – roughly speaking, the optimizer would first pick a join order as if there were only inner joins, and then remember that some of them are actually semi-joins and try to find how it can resolve semi-joins with the picked join order.
  3. Materialization only worked in the outer-to-inner fashion, that is, if you got a query like
    select * from people where name in (select owner from aircraft) it would still scan the people and make lookups into a temporary table of aircraft owners. It was not possible to make it scan the temptable of aircraft owners and make lookups into people.

WL#3985 fully addresses #1 and #2, and partially addresses #3. That is, now

  • Semi-join subqueries can use Materialization in an inner-to-outer fashion
  • Join optimizer is aware of existence of semi-joins and makes a fully automatic, cost-based choice between FirstMatch, DuplicateWeedout, LooseScan, inner-to-outer and outer-to-inner variants of Materialization.

This is expected to be a considerable improvement. The most common class of subqueries,
SELECT ... WHERE expr IN (SELECT ... w/o GROUPing/UNIONs/etc) AND ...
is now covered by a reasonably complete set of execution strategies and the optimizer is expected to have the capability to choose a good strategy for every case.

Possible gotchas, and we’re looking for input

I can’t state that the subquery optimizer does have the capability to pick a good plan because we haven’t done much experiments with the subquery cost model yet. We intend to do some benchmarking, but will also very much appreciate any input on how does the subquery optimizer behave for real-world queries. The code should be reasonably stable now – there are only three known problems, all of which are not very popular edge cases:

  • LEFT JOINs. You may get wrong query results when the subquery or parent subquery use left joins.
  • “Grandparent” correlation. A query with a semi-join child subquery which has a semi-join grandchild subquery which refers to a column in the top-level select may produce wrong query plans/results under certain circumstances.
  • Different datatypes. You may get wrong query results of queries that have col1 IN (SELECT col2) where col1 and col2 are of different types (which should not happen too often in practice)

If you have subqueries with LEFT JOINs, please let us know also, because so far all LEFT JOIN+subquery cases we have were generated by the random query generator, certain properties of MySQL codebase make it difficult to make outer joins work with semi-joins, and if we don’t get any real-world LEFT JOIN examples, chances are we will disable subquery optimizations if there’s LEFT JOIN in the parent select, or in the subquery, or in either case.

MySQL 6.0 news: Batched Key Access is in

Ok this isn’t very timely reporting, but about two weeks ago Batched Key Access feature has been pushed into MySQL 6.0. You can get it from the bazaar repo now (bzr branch lp:mysql-server/6.0), or wait several more weeks till MySQL 6.0.9 is released and get it from there.

Batched Key Access in a nutshell

BKA is about accessing tables in batches when running nested loop joins. The benefits of batching table accesses are that

  • “Remote” engines save on number of roundtrips
  • Disk-based engines do reads in disk order instead of randomly probing the table, which allows to be easier on disk cache and take advantage of prefetching

Batched Key Access only works if the used storage engine supports it. At the moment there is support for MyISAM, InnoDB, Maria, Falcon (these are disk-based) and NDB (this one is remote) engines.


At the moment there’s no manual chapter yet. There is a short introduction at Batched_Key_Access page on the forge and there are MySQL Conference 2008 session slides. The slides cover some benchmarking and give an idea about what kind of queries and dataset you need to get speedups with MyISAM/InnoDB. We’ve seen great speedup with NDB also but didn’t publish anything so far.


With Batched Key Access and condition pushdown, it is now feasible to create a remote table engine with decent performance. We have a remote engine, ha_federated and it doesn’t support BKA or condition pushdown and is a death by latency if you have queries that do not match the

SELECT * FROM table WHERE primary_key=const

pattern. I have a strong temptation to code a performance version of ha_federated myself, but have to resist it as there is subquery optimization work to be finished and optimizer “bugs” to be addressed.

This is now a rather low-hanging fruit, any takers?