An idea: create ha_trace tracing storage engine

Our exprience in solving query optimizer problems shows that a good chunk of optimization problems are incorrect choice of join order. The most frequent causes of the problems are

  1. Table engine returns index statistics or records-in-range numbers that are very far from reality;
  2. The WHERE clause has high-selectivity conditions for one table and low-selectivity conditions for the other. The optimizer is not aware of this and chooses a poor join order;
  3. Bugs or shortcomings in the optimizer or the table engine;

At the moment investigation of those kinds of problems is hard and time-consuming:

  • There is no easy way to find out what the storage engine has returned to the optimizer from records-in-range calls. One has to manually repeat the steps taken by equality propagation, constant table detection and other query rewrites, construct the table’s condition and run the corresponding single-table EXPLAIN;
  • You have to construct and run separate queries to check if the table/index statistics matches the actual distribution of the data examined by the query (far-off statistics values for skewed data distribution is a frequent problem);
  • There is no easy way to obtain information about the selectivity of the table’s condition. You either need a debugger or, in some cases, you can manage with doing non-trivial calculations over the values of Handler_read_XXX counters.

A completely different problem is that now some storage engines are provided by other parties than MySQL, and we’ll need to be able to find out who is at fault when something doesn’t work with a 3rd-party engine.

Having spent some time thinking about both of the above, I’ve got an idea that both can be resolved by a single solution – write a tracing storage engine.

The ha_trace engine

The engine will be used this way:

  1. You take the tables that are involved in the problematic query and “wrap” them in ha_trace tables:
    CREATE TABLE trace_tbl1 (... copy the DDL ...) ENGINE=TRACE UNDERLYING_TBL='tbl1' 

    The trace tables look and can be queried exactly like their underlying tables

  2. You run the problematic query againist the wrapper tables. The trace engine intercepts all storage engine calls and either just records them or accumulates some summary information.
  3. You obtain the trace results by querying some special $trace$results table.

The trace engine can do many practically useful things:

Statistics checker mode
This mode will check if the engine statistics numbers are any good. ha_trace will intercept statistics and record retrieval API calls and compile a report on how much were the differences between statistics and the observed table data. It could also build histogram of the observed data, which would give a clue about whether running ANALYZE TABLE would help.

Call trace mode
This is similar to what ltrace(1) does. The advantage over ltrace is that ha_trace knows the meaning of the calls and can produce more meaningful output. In particular, it can provide smarter call filtering and compression of repetitive call patterns.

Call trace logs can be used to check if the performed calls are compliant with the storage engine API and find out who is the violator, SQL Layer or the engine.

Query trace mode
This is similar to call trace but more high level. It will provide a view of how query was executed, giving information about how many rows were scanned in each table and so forth. We’ll need to use some kind of compression to reduce the size of the collected data. I’m thinking of building a regular expression/automaton with counters in the edges, so the collected data looks like this:

  t1.rnd_init(TRUE /*do sequential scan*/)= 0
  t2.index_init(key1);
  1000 * {
      t1.rnd_next()= 0
      t2.index_read(...)= {0 - 700 times, EOF=300 times}
      20 (min=1,max=200) * { t2.index_next_same() }
   }
   t2.index_end()=0;
   t1.index_end()=0;

If you haven’t already figured out how the EXPLAIN of this query looks, here it is:

explain select * from t1, t2 where t2.key=t1.col;
+----+-------------+-------+------+---------------+------+---------+---------+------+-------+
| id | select_type | table | type | possible_keys | key  | key_len | ref     | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+---------+------+-------+
|  1 | SIMPLE      | t1    | ALL  | NULL          | NULL | NULL    | NULL    | 1000 |       |
|  1 | SIMPLE      | t2    | ref  | a             | a    | 5       | t1.key1 |   22 |       |
+----+-------------+-------+------+---------------+------+---------+---------+------+-------+

The trace shows a lot of useful information:
* In 300 of 1000 cases, table t2 had no matching records
* When there were matches, there were on average 20 matches
* There was an off-the-scale case when there were 200 matching rows
And you can just see all this, without the doing arithmetic exercises over ‘Handler_read_%’ status variables.

Whether this is the right solution

Wrapper engine is the natural solution for troubleshooting problems between storage engines and the SQL layer (think of ltrace, strace, ODBC trace, etc etc). It is a reasonably good query trace solution too:

  • Per-table and per-query tracing is a fine level of granularity that is big step ahead of counters
  • ha_trace adds exactly zero overhead when it is not enabled
  • Works for all current and future storage engines
  • ha_trace is probably the only tracing solution that doesn’t require making changes all over the code.

.
.
.
Having written all this in one sitting (well, also some subway standing), I’ll now try to spend the remainder of this weekend in a less geeky way.

4 Comments

  1. Brian Aker says:

    Hi!

    Go for it ๐Ÿ™‚

    It sounds like an excellent idea.

    Cheers,
    -Brian

    Like

  2. I don’t know if I am going to be in flamestorm, but:

    As an oracle and mysql administrator, there are things to be learned from a database engine like oracle.

    Like: the wait interface.
    The oracle RDBMS keeps track of time spent by recording portions of code under ‘events’. This are event like:
    -db file sequential read
    (table fetch by rowid, index scanning)
    -db file scattered read
    (multiblock reads of tables (aka full table scans), and indexes (an index non-ordered full scan)
    -direct path write
    (writing to the temporary area for sorts, hashing or LOB which exceed in-memory area)
    etc. (there are 878 of these events in oracle 10)

    These are recorded on both global and session level.

    Has this impact on performance? Probably…but it’s better to know what is/has happening than not to. (you cannot solve something which you don’t know)

    Like

  3. Jeremy Cole says:

    Hi Sergey,

    Sounds like a wonderful idea, but I’d rather see it implemented completely internal to MySQL rather than as another hacky storage engine. I think the code would look much the same, just the code to enable/disable it would be a lot more user friendly. Perhaps:

    SET TRACE STORAGE ENGINE InnoDB ON;

    As well as some solution for tracing individual tables always, maybe:

    SET TRACE TABLE db.table ON;

    And some way to trace an individual table for a single session:

    SET SESSION TRACE TABLE db.table AS tracedtable ON;

    Just some ideas. ๐Ÿ™‚

    Regards,

    Jeremy

    Like

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s