I’ve got a complaint about a very basic query that works poorly and I’m not sure how to handle it. The case is MDEV-25707. The query is very simple:
create table t (
key_col int,
...
INDEX idx1(key_col)
);
select *
from t
where
key_col=const1 or
key_col=const2 or
-- and so forth, about 5K constants
In MariaDB 10.3, this query used a range scan. The EXPLAIN looked like so:
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
| 1 | SIMPLE | t | range | idx1 | idx1 | 3 | NULL | 101358 | Using index condition |
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
In MariaDB 10.5, it uses a full table scan, which is a surprise:
+------+-------------+-------+------+---------------+------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+---------------+------+---------+------+---------+-------------+
| 1 | SIMPLE | t | ALL | idx1 | NULL | NULL | NULL | 1761063 | Using where |
+------+-------------+-------+------+---------------+------+---------+------+---------+-------------+
and the query is slow. Why?
A bit of background: Normally, MySQL/MariaDB optimizer uses index dives to estimate how many rows will match each key_col=const range. This produces fairly precise estimates, but it does index dives (in this case, 5K index dives) which can be relatively expensive.
To combat this, MySQL and then MariaDB have introduced @@eq_range_index_dive_limit
feature (MySQL docs, MariaDB docs). The idea is, when the query has many equality ranges, the optimizer will not do index dives but will use index cardinality statistics instead.
In this dataset, index statistics says there are about 70 rows with each key value. This gives an estimate of 70*5k = 350K rows in total. However, the query uses constants that either occur rarely or are not present in the dataset at all, so the query produces only 120 rows. Because of that, range scan is much cheaper than a full table scan.
It is obvious to suggest to set eq_range_index_dive_limit
higher (or to 0 which means “never use index statistics”), but the users would like the optimizer to make the right choice automatically, without any tuning.
Certain kinds of Histograms, e.g. Most-Common-Value lists allow one to detect that the constants used by the query are very common. But is there any statistics to detect the opposite – that the used values are uncommon?