Optimizations for MIN/MAX with implicit and explicit grouping

Varun Gupta has pushed a fix for MDEV-24353. I was the reviewer. While investigating and fixing the issue, we had to revisit optimizations for MIN/MAX and different kinds of grouping.

SQL language has two kinds of grouping. “Explicit grouping” is done when the select has a GROUP BY clause. “Implicit grouping” happens when the query has aggregate function(s) but doesn’t have a GROUP BY clause, like

select count(*) from tbl

Here, there is only one group that contains the entire select output.

MIN/MAX Optimizations

Let’s start from implicit grouping and a basic query:

select min(col) from tbl

If there is an (ordered) index over col, one can resolve the query by making one index lookup: look into the index to find the first (smallest) value (If column col is NULLable, find the first non-NULL value). If the WHERE condition has a range:

select min(col) from tbl where col >= 10 and col <20

one can make a lookup to find the first value of col that’s greater-or-equal than 10. The range can be on a non-first key part, as long as all key parts before it have equalities:

select min(keypart2) from tbl where keypart1='foo' and keypart2>3

But that’s all that allowed. If the WHERE clause has something else, the optimizer will need to scan the index until it finds the value that matches the WHERE. It is also apparent that this works for MIN and MAX and won’t work for any other aggregate like COUNT or SUM.

This optimization doesn’t have a name. It is used whenever it is applicable (that is, it is not cost-based) because making one index lookup is as cheap as one can get.

MIN/MAX and explicit grouping

With explicit grouping, one can use the same approach. The difference is that there are potentially many groups. This means one needs groups to be separate (not interleaved) and a way to jump from one group to the next. Both are achieved by having GROUP BY columns at the start of the group:

select min(keypart2) from tbl group by keypart1 where keypart2>10

Each group is processed by two lookups:

  1. An index navigation command to “find the next different value of keypart1”
  2. An index navigation command to “find the first value with keypart2>10”

However, these “jumps” only make sense if the groups are large enough. So, the choice of whether to use this optimization is cost-based. The jumps gave name to the optimization – it is called Loose Index Scan.

The problem in MDEV-24353

The testcase had this :

-- The table has PRIMARY KEY (p,a);

-- Query with implicit grouping (fast)
select MIN(a) from t1 where p = 2;

-- Query with explicit grouping (slow)
select MIN(a) from t1 where p = 2 group by p;

The query with implicit grouping used the optimization and was instant. The query with explicit grouping wasn’t using Loose Scan and was optimization for implicit grouping worked, while one for explicit grouping didn’t.

Eventually, we’ve figured out the cause – there was an old heuristic rule “ref access is better than range access on the same index using same #key parts”. This is true, but not for Loose Scan (which is range-like access internally). The fix was to adjust the heuristic.

A side note: surprisingly, PostgreSQL is exhibiting similar behavior: it used an efficient query plan for the query with explicit grouping and an inefficient query plan for the query with implicit grouping: paste. Does it have an optimization that’s similar to MySQL/MariaDB’s Loose Scan?

Replication at cluster edge

I had to do some work fixing replication and wanted to share the experiences.

Short:

  • MariaDB is getting new storage engines that share the data between MariaDB
    instances
  • There is a demand to run these “shared-disk” engines alongside regular
    engines and replication. Like so:
  • Replication breaks when some tables have changes from the primary (aka master) “automagically” appear on the replica (aka slave).
  • MariaDB is getting support for running “shared disk” engines alongside with replication. The user experience is not entirely seamless yet, but it’s getting better.

Longer:

MariaDB is getting new storage engines:

The first one is S3 Storage Engine, available in MariaDB 10.5. It’s basically a read-only Aria table that is stored in Amazon S3 or a compatible service. Its target use case is archive data that doesn’t get read much, and cloud environment where local storage is pay-per-gigabyte. In this setting, it pays off to put the archive into an S3 bucket and then share it between all MariaDB instances you have. Note that only the archive is shared, for non-archive data you’ll still want replication.

The second new engine is the Xpand Storage Engine. It is only available in MariaDB Enterprise. It connects to Xpand cluster (formerly known as Clustrix). Xpand cluster is a clustered database targeting mostly OLTP workloads. Think NDB Cluster which can also speak SQL.

If you’re just using Xpand, you don’t need replication:

  • All MariaDB nodes in the cluster immediately see all data
  • As for table definitions, Table Discovery feature allows one MariaDB instance to learn about the tables created through other MariaDB instances.

But it turns out there’s a demand to still have replication enabled. It’s used for hybrid Xpand-and- InnoDB setups, for propagating MariaDB-level entities like VIEWs, etc.

The problem

The base problem: when the replica (slave) starts to apply changes to the shared table, the data is already there. Moreover, if the replica has a delay, then the primary (master) might have already made further updates to it. Applying older updates from binlog will undo them and/or cause errors.

Also, we cannot binlog any statements that read the shared tables, as the data might be no longer there when the replica will be applying the statement. This includes e.g. statements like “ALTER TABLE xpand_table ENGINE=InnoDB”.

The solution

I think, the ideal would be to have a primary-only solution where the binlog is not allowed to have any references to the shared tables. But this wasn’t the choice made. As far as I understand, Monty wanted the binlog to be also usable as a self-contained backup. As a consequence, one needs to have special behavior both on the primary and on the replica.

Settings on the primary (master). Settings for S3 are on the left, Xpand settings are on the right:

s3_replicate_alter_as_create_select=1xpand_replicate_alter_as_create_select=1
xpand_master_skip_dml_binlog=1

Settings for the replica (slave):

s3_slave_ignore_updates=1xpand_slave_ignore_ddl=1

Unfortunately, the solution is not complete. One needs to be careful when creating table in one engine and then changing it to another. This can confuse a delayed replica: MDEV-24351, MDEV-24367.

But if one is careful, things seem to work. As it usually is with replication :-).

On JSON Path in MariaDB

Short version:

  • JSON Path in MariaDB is now documented: https://mariadb.com/kb/en/jsonpath-expressions/
  • It supports a subset of JSON Path from SQL:2016. The subset is close to what MySQL supports.
  • It does have an extension to do recursive searches.
  • The search for matches is done “on the fly”, without constructing the document tree.

Longer version:

I got interested in JSON Path for two reasons: first, I was using it to extract portions of Optimizer Trace. Second, I am reviewing the patch for JSON_TABLE.

Unfortunately there wasn’t a description of MariaDB’s JSON Path dialect anywhere. Did everybody think paths are too obvious and something like foo.bar.baz is too trivial to document?

Well, it’s is not that simple. The definition of JSON Path in SQL:2016 standard takes 23 pages. Path expression can include filters, functions and arithmetic expressions (MariaDB doesn’t support any of those). One thing that the standard does NOT include is a capability to do recursive searches. You can ask for all members of an object or an array:

$.foo.*

and you can find bar in the grand-children of foo:

$.foo.*.bar

but there’s no way to search for bar everywhere in the tree. MariaDB, MySQL and PostgreSQL have extensions for recursive searches:

$**.bar

will search for bar everywhere in the document. (In PostgreSQL, the syntax is slightly different: $.**.bar). This is very useful for querying the Optimizer Trace:

select
json_extract(trace, '$**.analyzing_range_alternatives'))
from

information_schema.optimizer_trace

Some notes about the implementation

MariaDB doesn’t support filter conditions or [last] as array index. Which means we can walk through JSON document, and as soon as we see a node, we can tell whether it is a match or not. There is no need to build the parse tree or perform some kind of look-ahead.

As far as I understand, MySQL may or may not build the parse (DOM) tree, depending on the properties of the search path.

Slides and video: ANALYZE for statements: MariaDB’s hidden gem

My second talk at the MariaDB Fest was ANALYZE for statements: MariaDB’s hidden gem (slides, video). It also has a comparison with MySQL, with its EXPLAIN ANALYZE feature. MariaDB’s ANALYZE supports more kinds of statements, and a basic test has shown it has a lower CPU overhead for workloads with tight loops. I believe this is caused by MariaDB making fewer gettimeofday() calls. It might show fewer details about the query plan, though.

My experience with ANALYZE FORMAT=JSON was that it is immensely useful for troubleshooting query optimizer support issues.

Slides and video: Optimizer Trace walkthrough

It’s time to resume blogging. Starting with a bit of stale news: yours truly had two talks at MariaDB Fest . The first one was the Optimizer Trace Walkthrough (video, slides). The talk covers Optimizer Trace in MariaDB and it has a comparison with MySQL’s Optimizer Trace.

One interesting thing I’ve learned while presenting was that Øystein Grovlen has made a tool to summarize information from the trace – https://github.com/ogrovlen/opttrace.
Its main use case is analyze the join optimizer’s actions: it prints a condensed summary about what join order prefixes were considered and the reason they were discarded.


The tool doesn’t work with MariaDB, because MariaDB’s trace is slightly different. If I get a support escalation about a join optimization problem, I might try to get opttrace to work with MariaDB’s traces.

My proposals for Percona Live: Window Functions and ANALYZE for statements

I’ve made two session proposals for Percona Live conference:

if you feel these talks are worth it, please vote!

MariaDB 10.1: Better query optimization for ORDER BY … LIMIT

For some reason, we’ve been getting a lot of issues with ORDER BY optimization recently. The fixes have passed Elena Stepanova’s scrutiny and I’ve pushed them to MariaDB 10.1. Now, MariaDB’s ORDER BY ... LIMIT optimizer:

  • Doesn’t make stupid choices when several multi-part keys and potential range accesses are present (MDEV-6402)
  • Always uses “range” and (not full “index” scan) when it switches to an index to satisfy ORDER BY … LIMIT (MDEV-6657)
  • Tries hard to be smart and use cost/number of records estimates from other parts of the optimizer (MDEV-6384, MDEV-465, MySQL Bug#36817)
  • Takes full advantage of InnoDB’s Extended Keys feature when checking if filesort() can be skipped (MDEV-6796)

so, if you are struggling with poor optimization of ORDER BY ... LIMIT queries, help might be underway.

I have to admit that optimizations for ORDER BY... LIMIT are still far from perfect (both in MySQL and in MariaDB). The code is old and ugly and having spent some time debugging it, I can easily believe that it still does a poor job in some cases. The good part is that now we have some fresh knowledge of the ORDER BY … LIMIT optimization code, as well as RQG scripts for testing ORDER BY .. LIMIT queries. If you have any issues with ORDER BY LIMIT optimization, we will appreciate to see bugs filed for them.

A discovery – Index Condition Pushdown can cause a slowdown after all

MariaDB 5.5 and then MySQL 5.6 got Index Condition Pushdown (ICP) optimization (initially coded by yours truly). The idea of ICP is simple: after reading the index record, check the part of WHERE condition that can be computed using index columns, and only then read the table record. That way, we avoid reading table rows that don’t satisfy index condition:

It seems apparent that ICP can never make things slower. The WHERE clause has to be checked anyway, and not reading certain records can only make things faster.

That was what I thought, too, until recently Joffrey Michaie observed the contrary “in the wild”: we’ve got a real-world case where using Index Condition Pushdown was slower than not using it: MDEV-6713. The slowdown was about 20%, both on MariaDB and MySQL.

From what I could investigate so far, the slowdown is caused by these three factors working together:

  • a VARCHAR(255) column in the index. MySQL’s in-memory data representation for VARCHARs is not space efficient. If a column is defined as VARCHAR(255), any value will occupy the entire 255 bytes.
  • InnoDB’s row prefetch cache. When InnoDB detects that one is reading a lot of rows from an index, it pre-fetches index records and stores them in an internal cache. The cache uses the inefficient in-memory data representation.
  • Design of MySQL’s Item classes. One can’t evaluate an expression over a table record that is in InnoDB prefetch cache. Expression evaluation functions expect to find column values in the table’s “primary location”, internally known as table->record[0]. In order for ICP to check the condition, index columns have to be copied to table->record[0], first.

I hope we will be able to investigate this problem and post more about it soon. For now, the news is that ICP can cause a slowdown, when the index has big VARCHAR columns.

Priority queue optimization for filesort is now visible in MariaDB 10.0

TL;DR: Priority queue optimization for filesort with small LIMIT is now visible in MariaDB: there is a status variable and you can also see it in the slow query log (KB page link).

A longer variant:
One of the new optimizations in MySQL 5.6 is ability to use a priority queue instead of sorting for ORDER BY … LIMIT queries. The optimization was ported into MariaDB 10.0 long ago, but we still get questions if/when it will be ported. I guess, the reason for this is that, besides the query speed, you can’t see this optimization. Neither EXPLAIN, nor EXPLAIN FORMAT=JSON or PERFORMANCE_SCHEMA or status variables give any indication whether filesort used priority queue or the regular quicksort+merge algorithm.

MySQL 5.6 has only one way one can check whether filesort used priority queue. You need to enable optimizer_trace (set optimizer_trace=1), and then run the query (not EXPLAIN, but the query itself). Then, you can look into the optimizer trace and find something like this:

...
          "filesort_priority_queue_optimization": {
            "limit": 10,
            "rows_estimate": 198717,
            "row_size": 215,
            "memory_available": 262144,
            "chosen": true
          },
...

MariaDB doesn’t support optimizer_trace at the moment. Even if it did, I think it would be wrong to require one to look into the optimizer trace to find out about the picked query plan.

The natural place to show the optimization would be EXPLAIN output. We could show something like “Using filesort (priority queue)”. This was my initial intent. After looking into the source code, this turned out to be difficult to do. The logic that makes the choice between using quicksort+merge and using priority queue is buried deep inside query execution code. (As if the mess caused by late optimizations of ORDER BY and UNIONs didn’t teach anybody in MySQL team anything).

As for query execution, there are two facilities where one could record execution-time details about the query plan. They are the status variables and the slow query log.

Status variables

We’ve added Sort_priority_queue_sorts status variable. Now, the list of sort-related status variables is:

MariaDB [test]> show status like 'Sort%';
+---------------------------+-------+
| Variable_name             | Value |
+---------------------------+-------+
| Sort_merge_passes         | 0     |
| Sort_priority_queue_sorts | 1     |
| Sort_range                | 0     |
| Sort_rows                 | 11    |
| Sort_scan                 | 1     |
+---------------------------+-------+

(Sort_range + Sort_scan) gives total number of sorts. Sort_priority_queue_sorts gives number of sorts that were done using priority queue.

Slow query log

Percona’s Extended statistics in the slow query log shows Filesort/Filesort_on_disk fields. We thought that adding information about priority queue use would be appropriate. Now, slow query log entries look like this:

# Time: 140714 18:30:39
# User@Host: root[root] @ localhost []
# Thread_id: 3  Schema: test  QC_hit: No
# Query_time: 0.053857  Lock_time: 0.000188  Rows_sent: 11  Rows_examined: 100011
# Full_scan: Yes  Full_join: No  Tmp_table: No  Tmp_table_on_disk: No
# Filesort: Yes  Filesort_on_disk: No  Merge_passes: 0  Priority_queue: Yes
SET timestamp=1405348239;
select * from t1 where col1 between 10 and 20 order by col2 limit 100;

pt-query-digest is able to parse slow query logs with the new field.

What about PERFORMANCE_SCHEMA

What about PERFORMANCE_SCHEMA? After all, it is the most powerful tool for tracking query execution. It has “absorbed” some status variables into events_statements_history table. For sorting, it has these columns:

| SORT_MERGE_PASSES       | bigint(20) unsigned              | NO   |     | NULL    |       |
| SORT_RANGE              | bigint(20) unsigned              | NO   |     | NULL    |       |
| SORT_ROWS               | bigint(20) unsigned              | NO   |     | NULL    |       |
| SORT_SCAN               | bigint(20) unsigned              | NO   |     | NULL    |       |

Should we add a SORT_PRIORITY_QUEUE_SORTS column there? We didn’t add it into 10.0 right now because of compatibility concerns. Some tools may rely on the structure of PERFORMANCE_SCHEMA tables. Also, PERFORMANCE_SCHEMA table definitions are stored on disk, and one would have to run mysql_fix_privilege_tables after a minor upgrade, which is not good.