数据库
首页 > 数据库> > `MySQL GROUP BY使用索引时速度较慢

`MySQL GROUP BY使用索引时速度较慢

作者:互联网

我在AWS m4.large(2个vCPU,8 GB内存)上运行,并且看到有关MySQL和GROUPBY的行为有些令人惊讶.我有这个测试数据库:

CREATE TABLE demo (
  time INT,
  word VARCHAR(30),
  count INT
);
CREATE INDEX timeword_idx ON demo(time, word);

我插入4,000,000条记录,这些记录具有(均匀)随机词“ t%s”%random.randint(0,30000)和时间random.randint(0,86400).

SELECT word, time, sum(count) FROM demo GROUP BY time, word;
3996922 rows in set (1 min 28.29 sec)

EXPLAIN SELECT word, time, sum(count) FROM demo GROUP BY time, word;
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------+
| id | select_type | table | type  | possible_keys | key          | key_len | ref  | rows    | Extra |
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------+
|  1 | SIMPLE      | demo  | index | NULL          | timeword_idx | 38      | NULL | 4002267 |       |
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------+

然后我不使用索引:

SELECT word, time, sum(count) FROM demo IGNORE INDEX (timeword_idx) GROUP BY time, word;
3996922 rows in set (34.75 sec)

EXPLAIN SELECT word, time, sum(count) FROM demo IGNORE INDEX (timeword_idx) GROUP BY time, word;
+----+-------------+-------+------+---------------+------+---------+------+---------+---------------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows    | Extra                           |
+----+-------------+-------+------+---------------+------+---------+------+---------+---------------------------------+
|  1 | SIMPLE      | demo  | ALL  | NULL          | NULL | NULL    | NULL | 4002267 | Using temporary; Using filesort |
+----+-------------+-------+------+---------------+------+---------+------+---------+---------------------------------+

如您所见,使用索引查询将花费3倍的时间.我并不感到惊讶,因为通过使用索引,查询可能不得不避免读取时间和单词列,但是不幸的是,索引是如此稀疏,因此它不会带来太多收益.相反,在检索计数时,它将直接扫描转换为随机访问模式.

我只是想确认这就是原因,并且想知道是否有一个“紧凑规则”规定何时将索引用于GROUP BY时最终会带来更差的性能.

编辑:

我遵循Gordon Linoff的回答并使用:

CREATE INDEX timeword_idx ON demo(time, word, count);

与完整扫描相比,“覆盖索引”计算结果快10倍:

SELECT word, time, sum(count) FROM demo GROUP BY time, word;
3996922 rows in set (3.36 sec)

EXPLAIN SELECT word, time, sum(count) FROM demo GROUP BY time, word;
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------------+
| id | select_type | table | type  | possible_keys | key          | key_len | ref  | rows    | Extra       |
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------------+
|  1 | SIMPLE      | demo  | index | NULL          | timeword_idx | 43      | NULL | 4002267 | Using index |
+----+-------------+-------+-------+---------------+--------------+---------+------+---------+-------------+

非常令人印象深刻!

解决方法:

您有一个合理大小的表,因此问题可能是数据的顺序访问或崩溃.使用索引需要先浏览索引,然后在数据页中查找数据以获取计数.

实际上,这可能比仅阅读页面并进行排序要糟糕,因为页面没有按顺序阅读.与随机读取相比,顺序读取的优化程度要高得多.在最坏的情况下,页面缓存已满,并且随机读取需要刷新页面.如果发生这种情况,可能需要多次读取单个页面.只有400万个相对较小的行,除非内存受到严重限制,否则不太可能发生崩溃.

如果这种解释是正确的,那么在索引中包含count可以加快查询速度:

CREATE INDEX timeword_idx ON demo(time, word, count);

标签:database-indexes,mysql,database,group-by
来源: https://codeday.me/bug/20191027/1941718.html