Making “tbl.utf8mb3_key_column=utf8mb4_expr” sargable

Short: Currently, a condition like tbl.utf8mb3_key_column=utf8mb4_expr is not sargable, that is, the optimizer can’t use it to construct index lookups with utf8mb3_key_column. But it’s possible to make it sargable, the patch for MDEV-32113 is now in code review.

Long: First, let’s discuss why this is relevant:

The issue

MySQL and MariaDB support two variants of UTF-8: utf8mb3 is actually a subset of UTF-8 which only includes characters from the Basic Multilingual Plane. utf8mb4 covers the whole UTF-8 (all planes) and can store any Unicode character. One can also just specify “utf8“:

create table t1 (col varchar(100) character set utf8);

Before MariaDB 10.6, utf8 meant utf8mb3. Starting from MariaDB 10.6, its meaning is controlled by the @@old_mode variable. By default, old_mode=UTF8_IS_UTF8MB3, and one gets the old behavior where utf8 means utf8mb3. If one requests the new behavior by setting old_mode='', utf8 means utf8mb4.

The general trend is to switch to using utf8mb4 for everything. But due to the above, an old application or database can end up with a mix of utf8mb3 and utf8mb4 columns. And have queries that do joins on equalities between mb3 and mb4 columns:

-- tbl1 has an index
create index IDX1 on tbl1(mb3_key_col);

-- join on an indexed column
select * from tbl1, tbl2 where tbl1.mb3_key_col = tbl2.mb4_col

Current MariaDB will handle the cross-charset comparison by injecting a CONVERT call to bring mb3 column to mb4. The WHERE will be rewritten into:

... where convert(tbl1.mb3_key_col using utf8mb4) = tbl2.mb4_col

But then, the optimizer won’t consider using index IDX1(mb3_key_col) to perform the join operation.

Fixing it

Can a string comparison operation over utf8mb4 be translated into an index lookup for utf8mb3? It depends on which collation is used. The default and most popular collation for utf8 is utf8mb{3,4}_general_ci, and in that collation the translation is possible.

The Basic Multilingual Plane (and so, utf8mb3) has a special character, the Unicode Replacement Character. According to sorting rules of utf8mb4_general_ci, any character that is outside of the Basic Multilingual Plane is compared as equal to the Unicode Replacement Character.

One can check this from SQL prompt using the WEIGHT_STRING function. Let’s check the weights of the Replacement Character and some characters outside the Basic Multilingual plane (I picked a smiley and letters from Old Turkish):

MariaDB> set names utf8mb4;
Query OK, 0 rows affected (0.000 sec)

MariaDB> create table test_weights(col varchar(32) collate utf8mb4_general_ci);
Query OK, 0 rows affected (0.003 sec)

MariaDB> insert into test_weights values ...;

MariaDB> select col, hex(weight_string(col)) from test_weights;
+------+-------------------------+
| col  | hex(weight_string(col)) |
+------+-------------------------+
| �    | FFFD                    |
| 😊     | FFFD                    |
| 𐰌     | FFFD                    |
| 𐰒     | FFFD                    |
...
+------+-------------------------+

They all compare the same. That is, for the purpose of comparison, we can take an utf8mb4 value and replace all non-BMP (4-byte) characters with the Replacement Character. The obtained string is an utf8mb3 and so can be used for index lookups in the index that uses utf8mb3. This is what the patch for MDEV-32113 does.

3 thoughts on “Making “tbl.utf8mb3_key_column=utf8mb4_expr” sargable

Leave a comment