Using Linux perf: do we need to pass identifying info as arguments to important functions?

TL;DR:

  • One can use Linux perf to collect info about where the statement is spending its time.
  • But you only get function names. You can’t tell which SP/Table/etc it was.
  • perf allows to record variables but for some reason it doesn’t allow to record this->member_var.
  • We could work it around by passing “identifying” value like this->name as arguments to “important” functions like sp_head::execute (run a stored routine). Should we?

Longer:

Sometimes one has to analyze statement execution at a finer detail than ANALYZE FORMAT=JSON has. I was involved in such case recently: an UPDATE statement invoked a trigger which ran multiple SQL statements and invoked two stored functions. ANALYZE FORMAT=JSON showed that the top-level UPDATE statement didn’t have any issues. The issue was inside the trigger but where exactly?

We used Linux perf tool. In MariaDB (and MySQL) stored routine is executed by sp_head::execute(). One can track it like so:

# Add the probe:
perf probe -x `which mariadbd` --add _ZN7sp_head7executeEP3THDb
perf probe -x `which mariadbd` --add _ZN7sp_head7executeEP3THDb%return

# Collect a list of probes 
PROBES=`perf probe -l 'probe_mariadb*' | awk '{ printf " -e %s", $1 } '`;

# Now, PROBES has " -e probe_mariadbd:_ZN7sp_head7executeEP3THDb \
#    -e probe_mariadbd:_ZN7sp_head7executeEP3THDb__return" 

then you can note your session’s thread id (TODO: does this work when using a thread pool?) :
select tid from information_schema.processlist where id=connection_id();
then have pref record the profile and run your query:

perf record $PROBES -t $THREAD_ID
^C

perf script
mariadbd 1625339 [005] 874721.854399: probe_mariadbd:_ZN7sp_head7executeEP3THDb: (55942da55c3a)
mariadbd 1625339 [005] 874722.855064: probe_mariadbd:_ZN7sp_head7executeEP3THDb__return: (55942da55c3a <- 55942da586eb)
mariadbd 1625339 [005] 874722.855102: probe_mariadbd:_ZN7sp_head7executeEP3THDb: (55942da55c3a)
mariadbd 1625339 [005] 874724.855253: probe_mariadbd:_ZN7sp_head7executeEP3THDb__return: (55942da55c3a <- 55942da586eb)

Column #4 is time in seconds. This is nice but it’s not possible to tell which SP is which.

perf allows access to local variables:

perf probe -x `which mariadbd` -V  _ZN7sp_head7executeEP3THDb%return
Available variables at _ZN7sp_head7executeEP3THDb%return
        @<execute+0>
                CSET_STRING     old_query
                Diagnostics_area*       da
                ...
                sp_head*        this
                ...

but when I’ve tried to record this->m_name.str, that failed:

perf probe -x `pwd`/mariadbd  --add '_ZN7sp_head7executeEP3THDb this->m_name.str'
Probe on address 0x880c3a to force probing at the function entry.

this is not a data structure nor a union.
  Error: Failed to add events.

What if the name of the stored function was passed as function argument? I edited MariaDB’s source code and added it (full diff):

--- a/sql/sp_head.cc
+++ b/sql/sp_head.cc
@@ -1192,7 +1192,7 @@
 */
 
 bool
-sp_head::execute(THD *thd, bool merge_da_on_success)
+sp_head::execute(THD *thd, bool merge_da_on_success, const char *sp_name)
 {
   DBUG_ENTER("sp_head::execute");
   char saved_cur_db_name_buf[SAFE_NAME_LEN+1];

Compiled, restarted the server and I was able to add a perf probe that records the name:

perf probe -x `pwd`/mariadbd \
   --add '_ZN7sp_head7executeEP3THDbPKc sp_name:string'
perf record
^C
perf script 

This produced (line breaks added by me):

    mariadbd 1627629 [003] 877069.642164: probe_mariadbd:_ZN7sp_head7executeEP3THDbPKc: (564e917f1c67) 
      sp_name_string="test.func1"
    mariadbd 1627629 [003] 877070.642395: probe_mariadbd:_ZN7sp_head7executeEP3THDbPKc: (564e917f1c67) 
      sp_name_string="test.func2"

sp_name_string shows which stored function was invoked.

The question: Should we now go now and add function arguments:

  • For SP execution, show which SP is executed (like it was done in this post)
  • For an SQL statement, show its text
  • For subquery execution, add its select_number
  • … and so forth ?

This would also help with crashing bugs – the stack trace would be informative. Currently the crash reports get benefit from dispatch_command() function, a random example from MDEV-22262:

#14 0x0000562e90545488 in dispatch_command ( ...
  packet=packet@entry=0x7efdf0007a19 "UPDATE t1 PARTITION (p1) SET a=3 WHERE a=8" ... )

but if the crash happened when running a Prepared Statement, one is out of luck.

Notable optimizer fixes released in February, 2024

MariaDB has made new releases of stable versions in February. Like with previous release batch, I want to list notable changes in the query optimizer:

MDEV-32958: Unusable key notes do not get reported for some operations

This is a follow-up to MDEV-32203 I’ve covered for the previous release: MariaDB now emits a warning for conditions in form indexed_column CMP_OP const that are unusable for the optimizer. The most common case where they are not usable is varchar_column=INTEGER_CONSTANT but there are less obvious cases as well, like mismatched collations.

The original patch failed to produce the warning in some some cases. Now, this is fixed.

Writing varchar_col=INTEGER_CONSTANT looks like a newbie mistake, but it is not. I’ve encountered several such cases in the last couple of months alone. They were in fairly complex and well-written queries. MDEV-32203 was a good idea.

MDEV-33118: Add optimizer_adjust_secondary_key_costs variable

This is added to address poor join query plans. Consider a query plan using ref access:

select * from t1, t2 where t2.key1=t1.col1 and t2.key2='foo'
+------+-------------+-------+------+---------------+------+---------+---------+------+-------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref     | rows | Extra       |
+------+-------------+-------+------+---------------+------+---------+---------+------+-------------+
|    1 | SIMPLE      | t1    | ALL  | NULL          | NULL | NULL    | NULL    | 1000 | Using where |
|    1 | SIMPLE      | t2    | ref  | key1          | key1 | 5       | t1.col1 | 200  | Using where |
+------+-------------+-------+------+---------------+------+---------+---------+------+-------------+

When computing cost of reading table t2 by doing index lookups using t2.key1=t1.col1, MariaDB tried to take into account that some of the reads would hit the cache. Basically, the total cost of all lookups was capped by “worst_seeks” value which was a function of how much we would read from table t2 if we read its matching rows “independently” of table t1.
However this cap didn’t apply for all possible ref accesses. ref accesses that have a constant key part (like t2.key2='foo' in this example) “borrowed” #rows and cost estimate from the range optimizer, and that number was not capped.

This resulted in very poor query plan choices in some scenarios. The visible effect was that the optimizer picked an obviously bad ref access plan when a better option was clearly present.

Another related issue was the relative costs of clustered index scans and secondary index scans. Secondary index scans cost was too low.

Both of these issues are fixed in The Big Cost Model Rewrite in MariaDB 11.0. But if one can’t upgrade to 11.0 yet, they can get these fixes in MariaDB 10.6+ by setting optimizer_adjust_secondary_key_costs accordingly.

MDEV-32148: Inefficient WHERE timestamp_column = datetime_const_expr

MariaDB (and MySQL) has two datatypes for storing points in time: TIMESTAMP and DATETIME.

DATETIME is “YYYY-MM-DD HH:MM:SS” value, without specifying which time zone it is in.

TIMESTAMP is a point in time. It is “the number of [micro]seconds since midnight January 1st, 1970 GMT”. When you read a TIMESTAMP column, it is converted to ‘YYYY-MM-DD …’ datetime in your local @@session.time_zone.

Consider Query-1 which compares a timestamp column with a datetime literal:

SELECT ... FROM tbl WHERE timestamp_column <= 'YYYY-MM-DD ...'

Here, for each row considered, MariaDB would convert the value of timestamp_column into a DATETIME structure consisting of Year, Month, Date, …. and then compare it with the DATETIME structure representing ‘YYYY-MM-DD…’. But if TIMESTAMPs are just integers, why not compare as TIMESTAMPs instead?

This is surprisingly complex. First, DATETIME values have a wider range: they span from year 0000 to 9999 while TIMESTAMP covers only 1970 to 2038. Second, DST time changes mean that some DATETIME values map to two possible TIMESTAMP points-in-time: one before the clock is moved backwards, and one after. When the clock is moved forward, there are DATETIME values that do not map to any TIMESTAMP.

The patch for MDEV-32148 carefully takes all these limitations into account and makes queries like Query-1 use TIMESTAMP comparisons whenever it’s safe.

Optimizer-related fixes in current batch of stable releases

(EDIT: added also MDEV-32203)

MariaDB released has made a batch of releases in the stable series. I wanted to list notable query optimizer improvements there, together with some comments.

MDEV-29693: ANALYZE TABLE still flushes table definition cache when engine-independent statistics is used

If one has a long running query and at the same time runs ANALYZE TABLE to collect EITS statistics (think histograms) in parallel with it, this would cause all new queries to wait until the long running query finishes.

Why was it so? In-memory representation of column histograms (and other EITS statistics) was shared between all threads. It didn’t look safe to replace the statistics while some other threads could be using them. The solution was: whenever ANALYZE TABLE had collected new statistics, it forced the old statistics out. This meant stopping any new queries and waiting for currently running queries to finish. Then, new statistics were loaded and new queries unblocked.

The new, more performant solution is to allow different threads to use different “versions” of statistics objects. Looking back, I can’t help wondering why MariaDB provides MVCC to its users
but has sometimes to do roll-our-own-baby-MVCC internally.

MDEV-31303: Key not used when IN clause has both signed and unsigned values

The optimizer couldn’t handle conditions like

unsigned_key_value IN ( small_constant, big_constant, ...)

where big_constant has the upper bit set (i.e. is greater than 2^(datatype_bits-1)) and small_constant doesn’t. The cause was that semantic analysis treated small_constant as signed integer and big_constant as unsigned. Non-uniform treatment made the predicate non-sargable. This limitation is now lifted.

MDEV-32113: utf8mb3_key_col=utf8mb4_value cannot be used for ref access

I’ve covered this in my previous post: Making “tbl.utf8mb3_key_column=utf8mb4_expr” sargable.

MDEV-32351: Significant slowdown for query with many outer joins

This was a quite complicated bug involving LEFT JOINs, join buffer and ON expressions depending only on the left join’s outer table. Apparently a query pattern like this:

t1 LEFT JOIN t2 ON ( condition_thats_rarely_true(t1.columns) AND ... )

is used by real-world queries (Magento), and MariaDB’s Block-Nested-Loops join even had optimization for such cases. This optimization was accidentally disabled by another bugfix a few releases ago.
Alas, there was no test coverage for it, so it wasn’t detected. Now, we do have test coverage.

MDEV-32475: Skip sorting if we will read one row

This came from a customer case. Consider a subquery in form:

  SELECT 
    (SELECT value
     FROM t1
     WHERE 
       t1.pk1=outer_table.col1 AND t1.pk2=outer_table.col2
     ORDER BY t1.col LIMIT 1
    ) 
  FROM
    outer_table ...

Table t1 has PRIMARY KEY(pk1, pk2). The subquery’s WHERE condition covers the whole Primary Key, so we know we will need to read just one row. However, the optimizer failed to detect that and tried to find the best way to handle ORDER BY … LIMIT. Here, it would manage to make a mistake and pick some index other than PRIMARY KEY. Reading that index produced records in ORDER BY clause order, but took much longer.

It’s risky to touch ORDER BY... LIMIT optimization code in stable versions, so we didn’t do that. What we could do was to make the optimizer detect that there is pk_part_n=outer_ref for every part of the Primary Key, and skip ORDER BY...LIMIT handling altogether. Outer_ref here is important as pk_part_n=constant were handled. Outer_ref is not a constant but it still “binds” the primary key column and makes sorting unnecessary. Btw, it turned out MySQL has the same kludge. Does this mean their ORDER BY … LIMIT optimization code has the same kind of gap?

MDEV-32203: Raise notes when an index cannot be used on data type mismatch

It’s surprising that this is done only now in 2023. The issue is ages old. One can use an incorrect column datatype and/or collation, for example store integers in a VARCHAR column:

create table t1 (
  varchar_col varchar(100);
  ...
);
...
select * from t1 where t1.varchar_col=123;

This looks a rookie mistake but we saw it happen by accident at paying customers, too. Selection queries produce the intended results most of the time (or even all the time, depending on your data). But under the hood, the comparison is done using double datatype. If one has added an INDEX(varchar_col), the optimizer won’t be able to use it, as it would have to search for all possible ways one could write the number 123: ‘123.0’, ‘123.000’, ‘+123’ and so forth. The EXPLAIN will show that index isn’t used:

MariaDB [test]> explain select * from t1 where t1.varchar_col=123;
+------+-------------+-------+------+-----------------+------+---------+------+------+-------------+
| id   | select_type | table | type | possible_keys   | key  | key_len | ref  | rows | Extra       |
+------+-------------+-------+------+-----------------+------+---------+------+------+-------------+
|    1 | SIMPLE      | t1    | ALL  | varchar_col_idx | NULL | NULL    | NULL | 1000 | Using where |
+------+-------------+-------+------+-----------------+------+---------+------+------+-------------+

After the fix for MDEV-32203 the EXPLAIN will produce a warning:

MariaDB [test]> show warnings\G
*************************** 1. row ***************************
  Level: Note
   Code: 1105
Message: Cannot use key `varchar_col_idx` part[0] for lookup: `test`.`t1`.`varchar_col` of type `varchar` = "123" of type `int`

Non-EXPLAIN statements won’t produce warnings by default (compatibility!), but one make them produce warnings as well using the new note_verbosity variable. Also, it’s possible to get these warnings in the Slow Query Log by adjusting log_slow_verbosity system variable.

The fix covers equalities between columns as well: t1.varchar_col=t2.int_col will produce a similar warning. The fix for some reason doesn’t cover IN clauses, varchar_col IN (1,2,3)is not handled. I’m going to check if this was intentional.

Making “tbl.utf8mb3_key_column=utf8mb4_expr” sargable

Short: Currently, a condition like tbl.utf8mb3_key_column=utf8mb4_expr is not sargable, that is, the optimizer can’t use it to construct index lookups with utf8mb3_key_column. But it’s possible to make it sargable, the patch for MDEV-32113 is now in code review.

Long: First, let’s discuss why this is relevant:

The issue

MySQL and MariaDB support two variants of UTF-8: utf8mb3 is actually a subset of UTF-8 which only includes characters from the Basic Multilingual Plane. utf8mb4 covers the whole UTF-8 (all planes) and can store any Unicode character. One can also just specify “utf8“:

create table t1 (col varchar(100) character set utf8);

Before MariaDB 10.6, utf8 meant utf8mb3. Starting from MariaDB 10.6, its meaning is controlled by the @@old_mode variable. By default, old_mode=UTF8_IS_UTF8MB3, and one gets the old behavior where utf8 means utf8mb3. If one requests the new behavior by setting old_mode='', utf8 means utf8mb4.

The general trend is to switch to using utf8mb4 for everything. But due to the above, an old application or database can end up with a mix of utf8mb3 and utf8mb4 columns. And have queries that do joins on equalities between mb3 and mb4 columns:

-- tbl1 has an index
create index IDX1 on tbl1(mb3_key_col);

-- join on an indexed column
select * from tbl1, tbl2 where tbl1.mb3_key_col = tbl2.mb4_col

Current MariaDB will handle the cross-charset comparison by injecting a CONVERT call to bring mb3 column to mb4. The WHERE will be rewritten into:

... where convert(tbl1.mb3_key_col using utf8mb4) = tbl2.mb4_col

But then, the optimizer won’t consider using index IDX1(mb3_key_col) to perform the join operation.

Fixing it

Can a string comparison operation over utf8mb4 be translated into an index lookup for utf8mb3? It depends on which collation is used. The default and most popular collation for utf8 is utf8mb{3,4}_general_ci, and in that collation the translation is possible.

The Basic Multilingual Plane (and so, utf8mb3) has a special character, the Unicode Replacement Character. According to sorting rules of utf8mb4_general_ci, any character that is outside of the Basic Multilingual Plane is compared as equal to the Unicode Replacement Character.

One can check this from SQL prompt using the WEIGHT_STRING function. Let’s check the weights of the Replacement Character and some characters outside the Basic Multilingual plane (I picked a smiley and letters from Old Turkish):

MariaDB> set names utf8mb4;
Query OK, 0 rows affected (0.000 sec)

MariaDB> create table test_weights(col varchar(32) collate utf8mb4_general_ci);
Query OK, 0 rows affected (0.003 sec)

MariaDB> insert into test_weights values ...;

MariaDB> select col, hex(weight_string(col)) from test_weights;
+------+-------------------------+
| col  | hex(weight_string(col)) |
+------+-------------------------+
| �    | FFFD                    |
| 😊     | FFFD                    |
| 𐰌     | FFFD                    |
| 𐰒     | FFFD                    |
...
+------+-------------------------+

They all compare the same. That is, for the purpose of comparison, we can take an utf8mb4 value and replace all non-BMP (4-byte) characters with the Replacement Character. The obtained string is an utf8mb3 and so can be used for index lookups in the index that uses utf8mb3. This is what the patch for MDEV-32113 does.

Combinatorial blowups can occur in Name Resolution, too.

Short: There are many places where (any) query optimizer has to take care to avoid combinatorial blowups of memory and/or processing time. But we had a real world query that had a combinatorial blowup at Name Resolution (aka “binding” or “semantic analysis”) phase.

Long: The trick is to use nested derived tables (or views) and use the column multiple times.

The example using views. Start from a base table:

create table t1 (
  f1 int,
  f2 int,
  ...
);

Then create a view with columns that refer to the base table multiple times:

create view v1 as
select 
  case 
    when f1 > f2 then f2
    when f1 > f3 then f3
    when s1 > f4 then f4
  end AS f1,
  ...
  case ...  as f2,
  ...
from 
  t1;

Here, v1.f1 has 9 references to columns of t1. Other columns have similar definitions.
Then, create view v2 that uses columns of v1 in a similar way:

create view v2 as
select 
  case 
    when f1 > f2 then f2
    when f1 > f3 then f3
    when s1 > f4 then f4
  end AS f1,
  ...
  case ...  as f2,
  ...
from 
  v1;

v2.f1 has 9 references to v1.f1. Each of those has 9 references to columns of t1, which gives 9*9=81 references to columns of t1.

Repeat this until you reach depth level 8, and now the query takes about 1 minute in MariaDB and MySQL. In PostgreSQL, it is killed due to running out of memory. All the details can be found in MDEV-31130.

In MariaDB (and MySQL), multiple references to the same column are shared, so query processing doesn’t require a combinatorial amount of memory (like it seems to do in PostgreSQL). But condition processing ignores shared references and so does require combinatorial amount of time.

One can imagine fixing this by extending all the code that processes scalar expressions to handle common sub-expressions. But I also imagine it will add a lot of complexity, all just to handle a fairly exotic case, right?

ANALYZE FORMAT=JSON now shows InnoDB buffer pool reads

Short: Starting from the latest MariaDB 10.6.15 and latest releases of other versions (yes, even stable ones), ANALYZE FORMAT=JSON prints counts of InnoDB page accesses and reads. Look for r_engine_stats. This is similar to PostgreSQL’s EXPLAIN (ANALYZE, BUFFERS).

Long: It was needed for diagnosing performance issues at a customer. The information is printed into slow query log (MDEV-31558), in a similar (but not identical) way how it’s done in Percona Server. The log shows total query increments for the query. ANALYZE FORMAT=JSON (MDEV-31577) output shows counters for each query plan node. It looks like this on a cold buffer pool:

MariaDB> analyze format=json select min(filler1), count(kp2)
      -> from t1 where kp1 between 1 and 1000\G
*************************** 1. row ***************************
ANALYZE: {
  "query_block": {
    "select_id": 1,
    "r_loops": 1,
    "r_total_time_ms": 10.832,
    "table": {
      "table_name": "t1",
      "access_type": "range",
      "possible_keys": ["kp1"],
      "key": "kp1",
      "key_length": "5",
      "used_key_parts": ["kp1"],
      "r_loops": 1,
      "rows": 981,
      "r_rows": 981,
      "r_table_time_ms": 10.488,
      "r_other_time_ms": 0.328,
      "r_engine_stats": {
        "pages_accessed": 2947,
        "pages_read_count": 1240,
        "pages_read_time_ms": 6.314
      },
      "filtered": 100,
      "r_filtered": 100,
      "index_condition": "t1.kp1 between 1 and 1000"
    }
  }
}

One can see that reading the table accessed 2947 pages which resulted in 1240 page reads which took 6.314 milliseconds.

If I re-run the query, it still accesses "pages_accessed": 2947 pages, but the pages are now in the buffer pool, so pages_read_count and pages_read_time_ms are zero and are not printed:

MariaDB> analyze format=json select min(filler1), count(kp2)
      -> from t1 where kp1 between 1 and 1000\G
*************************** 1. row ***************************
ANALYZE: {
  "query_block": {
    "select_id": 1,
    "r_loops": 1,
    "r_total_time_ms": 6.512,
    "table": {
      "table_name": "t1",
      "access_type": "range",
      "possible_keys": ["kp1"],
      "key": "kp1",
      "key_length": "5",
      "used_key_parts": ["kp1"],
      "r_loops": 1,
      "rows": 981,
      "r_rows": 981,
      "r_table_time_ms": 5.905,
      "r_other_time_ms": 0.59,
      "r_engine_stats": {
        "pages_accessed": 2947
      },
      "filtered": 100,
      "r_filtered": 100,
      "index_condition": "t1.kp1 between 1 and 1000"
    }
  }
}

UCASE() is now sargable in MariaDB. LCASE() can’t be.

Short: UCASE(key_col)=… is now sargable in MariaDB. One can ignore the UCASE() call here, but only for a few collations. One might expect this to work for any _ci (case-insensitive) collation, but it doesn’t. Surprisingly, LCASE() cannot be ignored even in utf8_general_ci.

Long: Some months ago, we got a set of customer queries that had comparisons like

UCASE(tbl1.key_col) = UCASE(tbl2.key_col)

The optimizer couldn’t process this condition at all, so it was computing a cross join. Also, it wasn’t able to estimate the condition selectivity, so the rest of the query plan was bad, too.

But, tbl1.key_col’s collation used a case-insensitive collation utf8mb3_general_ci, which gave an idea, perhaps

UPPER(tbl1.key_col)=something is equivalent to tbl1.key_col=something

After investigation with Alexander Barkov, we’ve discovered that this holds for utf8mb3_general_ci collation. But other case-insensitive collations have exceptions to this. For example, in Czech collation utf8mb3_czech_ci, 'CH' compares as if it was a single letter between ‘H’ and ‘I’. But 'cH' is not treated this way, which means UCASE('cH') will compare differently from 'cH'. MDEV-31494 has more details and more examples.

One would think LCASE() would be the same. It is not. Even in utf8mb3_general_ci, there are characters that compare differently from their LCASE(). The first example is the Angstrom Sign. It compares as different from letter A:

MariaDB [test]> select 'A'='Å' ;
+-----------+
| 'A'='Å'   |
+-----------+
|         0 |
+-----------+

The LCASE of Angstrom Sign is “Latin Small Letter A with Ring Above”, which compares as equal to letter a:

MariaDB [test]> select LCASE('A'), LCASE('Å'), LCASE('A')=LCASE('Å') ;
+------------+--------------+-------------------------+
| LCASE('A') | LCASE('Å')   | LCASE('A')=LCASE('Å')   |
+------------+--------------+-------------------------+
| a          | å            |                       1 |
+------------+--------------+-------------------------+

There are other similar examples. MDEV-31955 has the details.

Note that here LCASE(x)=... is true, and x= is not. If we do searches using x=... , we will miss rows for which LCASE(x)=... , so the rewrite cannot be done.