I’ve spent some time working on the join optimizer recently. Here are the slides covering MySQL/MariaDB join optimizer internals and the improvements that we have made or are making to it in MariaDB:
MariaDB 10.9: SHOW ANALYZE and EXPLAIN FOR CONNECTION
We’re trying to get two features into MariaDB 10.9: SHOW ANALYZE and EXPLAIN FOR CONNECTION. I’ve made a post about them in MariaDB Foundation’s blog:
Slides: improving selectivity estimates in MariaDB
Slides from my talk at MariaDB Server Fest 2021, “Improving MariaDB’s query optimizer with better selectivity estimates”, are here:
Direct link: pdf.
JSON_TABLE: implementer notes
MariaDB 10.6 is getting stable, and one of its features is JSON_TABLE. I was the code reviewer for it. Here are slides that cover JSON_TABLE from the user point of view. In this post, I wanted to share the experiences from implementing JSON_TABLE.
Contents:
- It’s all about JSON Path engine
- Supporting Table Function in query optimizer
- Name resolution rules for JSON_TABLE’s first argument
- JSON_TABLE is non-deterministic
It’s all about JSON Path engine
JSON_TABLE relies heavily on JSON Path engine. JSON_TABLE’s first argument is a JSON Path expression which points to JSON elements that will produce the JSON_TABLE’s “rows”. Then, there are JSON Path expressions which point to values of columns.
MariaDB had JSON parser and JSON Path evaluator before we started work on JSON_TABLE. They are integrated, there’s a single API to parse JSON document and locate the nodes pointed by a JSON Path expression. JSON_TABLE has a “primary” parser/path evaluator object to enumerate JSON elements for its “rows”. For each row, the primary parser/path evaluator is forked, and the fork(s) are used to compute the values of column(s). This means, there is double parsing of JSON structures. On the other hand, no in-memory structures are built, so memory footprint is low.
Supporting Table Functions in query optimizer
JSON_TABLE’s argument can refer to columns of other tables. That is, the contents of the JSON_TABLE’s table change depending on the current rows of other tables . This puts some limitations of where JSON_TABLE can be put in the query plan. This is similar to LATERAL derived tables (which MariaDB doesn’t support, yet).
Fortunately for the optimizer, there are some syntactic restrictions about JSON_TABLE’s argument that prevent one from creating complex and/or invalid dependencies with use of JSON_TABLE.
Name resolution rules for JSON_TABLE’s argument
Basically, JSON_TABLE’s argument can only contain references to tables that “precede” the JSON_TABLE in the FROM clause:

This rules out things like circular dependencies between multiple JSON_TABLE(…) invocations or JSON_TABLE’s dependencies being in conflict with LEFT JOIN dependencies.
RIGHT JOIN is trickier. MariaDB follows MySQL’s approach and declares that for RIGHT JOIN, we assume that it’s outer side “precedes” the inner side. This way, one can’t write a query with a dependency loop. Oracle 18c accepts queries with conflicting dependencies like “t1 RIGHT JOIN JSON_TABLE(t1.col)” (example in fiddle), but I don’t know what semantics it has for them.
JSON_TABLE is non-deterministic
This was discovered fairly late in development by Elena Stepanova. Let’s define a non-deterministic function as: function that may return non-equivalent values for equivalent inputs.
Then, let’s consider two JSON documents. As far as JSON data model goes, they are identical:
set @doc1= '{"color": "black", "price": 100 }';
set @doc2= '{"price": 100, "color": "black"}';
and consider this JSON_TABLE call:
select * from
json_table(@doc, '$.*' columns (id for ordinality,
val varchar(10) path '$')) as JT;
In MariaDB, this produces 1-black, 2-100 for @doc1 and 1-100,2-black for @doc2. This is because MariaDB computes JSON Path in a streaming fashion and ‘$.*’ returns object members in the order they appear in the JSON document.
MySQL, on the other hand, will return 1-100,2-black for both documents. This is because it parses the JSON and orders members by name before computing JSON Path.
Both seem to be valid, but then JSON_TABLE is not a “function” in mathematical sense.
PostgreSQL and ON expressions
I’ve stumbled upon this shortcoming of PostgreSQL optimizer: the restrictions in the LEFT JOIN’s ON expression are not used to access the outer table. The testcase is fairly basic:
test=# explain
test-# select *
test-# from t1 left join t2 on (t1.key1<100 AND t2.key1=t1.col);
QUERY PLAN
------------------------------------------------------------------------
Hash Left Join (cost=30832.00..73228.00 rows=1000000 width=16)
Hash Cond: (t1.col = t2.key1)
Join Filter: (t1.key1 < 100)
-> Seq Scan on t1 (cost=0.00..14425.00 rows=1000000 width=8)
-> Hash (cost=14425.00..14425.00 rows=1000000 width=8)
-> Seq Scan on t2 (cost=0.00..14425.00 rows=1000000 width=8)
(6 rows)
The optimizer prefers to use Seq Scan to read 1M rows from table t1, despite the condition “t1.key1<100” which matches only 100 rows, and has a suitable index. If I change LEFT JOIN into INNER, the optimizer will use an index:
test=# explain
test-# select *
test-# from t1 inner join t2 on (t1.key1<100 and t2.key1=t1.col);
QUERY PLAN
-----------------------------------------------------------------------------
Nested Loop (cost=5.59..1136.23 rows=95 width=16)
-> Bitmap Heap Scan on t1 (cost=5.16..341.25 rows=95 width=8)
Recheck Cond: (key1 < 100)
-> Bitmap Index Scan on t1_key1 (cost=0.00..5.14 rows=95 width=0)
Index Cond: (key1 < 100)
-> Index Scan using t2_key1 on t2 (cost=0.42..8.36 rows=1 width=8)
Index Cond: (key1 = t1.col)
The script to reproduce is here.
An interesting case with key_column=rare_value
I’ve got a complaint about a very basic query that works poorly and I’m not sure how to handle it. The case is MDEV-25707. The query is very simple:
create table t (
key_col int,
...
INDEX idx1(key_col)
);
select *
from t
where
key_col=const1 or
key_col=const2 or
-- and so forth, about 5K constants
In MariaDB 10.3, this query used a range scan. The EXPLAIN looked like so:
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
| 1 | SIMPLE | t | range | idx1 | idx1 | 3 | NULL | 101358 | Using index condition |
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
In MariaDB 10.5, it uses a full table scan, which is a surprise:
+------+-------------+-------+------+---------------+------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+---------------+------+---------+------+---------+-------------+
| 1 | SIMPLE | t | ALL | idx1 | NULL | NULL | NULL | 1761063 | Using where |
+------+-------------+-------+------+---------------+------+---------+------+---------+-------------+
and the query is slow. Why?
A bit of background: Normally, MySQL/MariaDB optimizer uses index dives to estimate how many rows will match each key_col=const range. This produces fairly precise estimates, but it does index dives (in this case, 5K index dives) which can be relatively expensive.
To combat this, MySQL and then MariaDB have introduced @@eq_range_index_dive_limit
feature (MySQL docs, MariaDB docs). The idea is, when the query has many equality ranges, the optimizer will not do index dives but will use index cardinality statistics instead.
In this dataset, index statistics says there are about 70 rows with each key value. This gives an estimate of 70*5k = 350K rows in total. However, the query uses constants that either occur rarely or are not present in the dataset at all, so the query produces only 120 rows. Because of that, range scan is much cheaper than a full table scan.
It is obvious to suggest to set eq_range_index_dive_limit
higher (or to 0 which means “never use index statistics”), but the users would like the optimizer to make the right choice automatically, without any tuning.
Certain kinds of Histograms, e.g. Most-Common-Value lists allow one to detect that the constants used by the query are very common. But is there any statistics to detect the opposite – that the used values are uncommon?
Slides: JSON Support: news, non-news, and the bigger picture
The slides from my FOSDEM’2020 talk are here
Speaking at FOSDEM about JSON in MariaDB
My talk titled “JSON Support: news, non-news, and the bigger picture” was accepted to MariaDB devroom at FOSDEM’21.
This means that on February, 6th there will be a video stream and a Q&A.
Optimizations for MIN/MAX with implicit and explicit grouping
Varun Gupta has pushed a fix for MDEV-24353. I was the reviewer. While investigating and fixing the issue, we had to revisit optimizations for MIN/MAX and different kinds of grouping.
SQL language has two kinds of grouping. “Explicit grouping” is done when the select has a GROUP BY clause. “Implicit grouping” happens when the query has aggregate function(s) but doesn’t have a GROUP BY clause, like
select count(*) from tbl
Here, there is only one group that contains the entire select output.
MIN/MAX Optimizations
Let’s start from implicit grouping and a basic query:
select min(col) from tbl
If there is an (ordered) index over col, one can resolve the query by making one index lookup: look into the index to find the first (smallest) value (If column col is NULLable, find the first non-NULL value). If the WHERE condition has a range:
select min(col) from tbl where col >= 10 and col <20
one can make a lookup to find the first value of col that’s greater-or-equal than 10. The range can be on a non-first key part, as long as all key parts before it have equalities:
select min(keypart2) from tbl where keypart1='foo' and keypart2>3
But that’s all that allowed. If the WHERE clause has something else, the optimizer will need to scan the index until it finds the value that matches the WHERE. It is also apparent that this works for MIN and MAX and won’t work for any other aggregate like COUNT or SUM.
This optimization doesn’t have a name. It is used whenever it is applicable (that is, it is not cost-based) because making one index lookup is as cheap as one can get.
MIN/MAX and explicit grouping
With explicit grouping, one can use the same approach. The difference is that there are potentially many groups. This means one needs groups to be separate (not interleaved) and a way to jump from one group to the next. Both are achieved by having GROUP BY columns at the start of the group:
select min(keypart2) from tbl group by keypart1 where keypart2>10
Each group is processed by two lookups:
- An index navigation command to “find the next different value of keypart1”
- An index navigation command to “find the first value with keypart2>10”
However, these “jumps” only make sense if the groups are large enough. So, the choice of whether to use this optimization is cost-based. The jumps gave name to the optimization – it is called Loose Index Scan.
The problem in MDEV-24353
The testcase had this :
-- The table has PRIMARY KEY (p,a);
-- Query with implicit grouping (fast)
select MIN(a) from t1 where p = 2;
-- Query with explicit grouping (slow)
select MIN(a) from t1 where p = 2 group by p;
The query with implicit grouping used the optimization and was instant. The query with explicit grouping wasn’t using Loose Scan and was optimization for implicit grouping worked, while one for explicit grouping didn’t.
Eventually, we’ve figured out the cause – there was an old heuristic rule “ref access is better than range access on the same index using same #key parts”. This is true, but not for Loose Scan (which is range-like access internally). The fix was to adjust the heuristic.
A side note: surprisingly, PostgreSQL is exhibiting similar behavior: it used an efficient query plan for the query with explicit grouping and an inefficient query plan for the query with implicit grouping: paste. Does it have an optimization that’s similar to MySQL/MariaDB’s Loose Scan?
Replication at cluster edge
I had to do some work fixing replication and wanted to share the experiences.
Short:
- MariaDB is getting new storage engines that share the data between MariaDB
instances - There is a demand to run these “shared-disk” engines alongside regular
engines and replication. Like so:

- Replication breaks when some tables have changes from the primary (aka master) “automagically” appear on the replica (aka slave).
- MariaDB is getting support for running “shared disk” engines alongside with replication. The user experience is not entirely seamless yet, but it’s getting better.
Longer:
MariaDB is getting new storage engines:
The first one is S3 Storage Engine, available in MariaDB 10.5. It’s basically a read-only Aria table that is stored in Amazon S3 or a compatible service. Its target use case is archive data that doesn’t get read much, and cloud environment where local storage is pay-per-gigabyte. In this setting, it pays off to put the archive into an S3 bucket and then share it between all MariaDB instances you have. Note that only the archive is shared, for non-archive data you’ll still want replication.
The second new engine is the Xpand Storage Engine. It is only available in MariaDB Enterprise. It connects to Xpand cluster (formerly known as Clustrix). Xpand cluster is a clustered database targeting mostly OLTP workloads. Think NDB Cluster which can also speak SQL.
If you’re just using Xpand, you don’t need replication:
- All MariaDB nodes in the cluster immediately see all data
- As for table definitions, Table Discovery feature allows one MariaDB instance to learn about the tables created through other MariaDB instances.
But it turns out there’s a demand to still have replication enabled. It’s used for hybrid Xpand-and- InnoDB setups, for propagating MariaDB-level entities like VIEWs, etc.
The problem
The base problem: when the replica (slave) starts to apply changes to the shared table, the data is already there. Moreover, if the replica has a delay, then the primary (master) might have already made further updates to it. Applying older updates from binlog will undo them and/or cause errors.
Also, we cannot binlog any statements that read the shared tables, as the data might be no longer there when the replica will be applying the statement. This includes e.g. statements like “ALTER TABLE xpand_table ENGINE=InnoDB”.
The solution
I think, the ideal would be to have a primary-only solution where the binlog is not allowed to have any references to the shared tables. But this wasn’t the choice made. As far as I understand, Monty wanted the binlog to be also usable as a self-contained backup. As a consequence, one needs to have special behavior both on the primary and on the replica.
Settings on the primary (master). Settings for S3 are on the left, Xpand settings are on the right:
s3_replicate_alter_as_create_select=1 | xpand_replicate_alter_as_create_select=1 xpand_master_skip_dml_binlog=1 |
Settings for the replica (slave):
s3_slave_ignore_updates=1 | xpand_slave_ignore_ddl=1 |
Unfortunately, the solution is not complete. One needs to be careful when creating table in one engine and then changing it to another. This can confuse a delayed replica: MDEV-24351, MDEV-24367.
But if one is careful, things seem to work. As it usually is with replication :-).