数据库
首页 > 数据库> > 优化MySQL完全外部联接以处理大量数据

优化MySQL完全外部联接以处理大量数据

作者:互联网

我们有以下mysql表(简化了一点)

CREATE TABLE `MONTH_RAW_EVENTS` (
  `idEvent` int(11) unsigned NOT NULL,
  `city` varchar(45) NOT NULL,
  `country` varchar(45) NOT NULL,
  `ts` datetime NOT NULL,
  `idClient` varchar(45) NOT NULL,
  `event_category` varchar(45) NOT NULL,
  ... bunch of other fields
  PRIMARY KEY (`idEvent`),
  KEY `idx_city` (`city`),
  KEY `idx_country` (`country`),
  KEY `idClient` (`idClient`),
) ENGINE=InnoDB;

CREATE TABLE `compilation_table` (
  `idClient` int(11) unsigned DEFAULT NULL,
  `city` varchar(200) DEFAULT NULL,
  `month` int(2) DEFAULT NULL,
  `year` int(4) DEFAULT NULL,
  `events_profile` int(10) unsigned NOT NULL DEFAULT '0',
  `events_others` int(10) unsigned NOT NULL DEFAULT '0',
  `events_total` int(10) unsigned NOT NULL DEFAULT '0',
  KEY `idx_month` (`month`),
  KEY `idx_year` (`year`),
  KEY `idx_idClient` (`idClient`),
  KEY `idx_city` (`city`)
) ENGINE=InnoDB;

MONTH_RAW_EVENTS包含将近2000万行用户在网站上执行了操作的行,其大小接近4GB

Compilation_table每个月都有一个摘要客户/城市,我们使用它来实时显示网站上的统计信息

我们每月处理一次统计信息(从第一个表到第二个表),并且尝试优化执行该操作的查询(因为到目前为止,我们在PHP中处理所有需要花很长时间的事情)

这是我们想到的查询,当使用少量数据子集时,似乎可以完成此工作,
该问题需要花费超过6个小时才能处理全部数据

INSERT INTO compilation_table (idClient,city,month,year,events_profile,events_others)


    SELECT  IFNULL(OTHERS.idClient,AP.idClient) as idClient,
            IF(IFNULL(OTHERS.city,AP.city)='','Others',IFNULL(OTHERS.city,AP.city)) as city,
        01,2014,
    IFNULL(AP.cnt,0) as events_profile,
        IFNULL(OTHERS.cnt,0) as events_others           

    FROM
    (
        SELECT idClient,CONCAT(city,', ',country) as city,count(*) as cnt 
        FROM `MONTH_RAW_EVENTS` WHERE `ts`>'2014-01-01 00:00:00' AND `ts`<='2014-01-31 23:59:59'
        AND `event_category`!='CLIENT PROFILE'
        GROUP BY idClient,city
    ) as OTHERS
 LEFT JOIN 
    (
        SELECT idClient,CONCAT(city,', ',country) as city,count(*) as cnt 
        FROM `MONTH_RAW_EVENTS` WHERE `ts`>'2014-01-01 00:00:00' AND `ts`<='2014-01-31 23:59:59'
        AND `event_category`='CLIENT PROFILE'
        GROUP BY idClient,city
    ) as CLIPROFILE 
    ON CLIPROFILE.city=OTHERS.city and CLIPROFILE.idClient=OTHERS.idClient

 UNION

    SELECT  IFNULL(OTHERS.idClient,CLIPROFILE.idClient) as idClient,
            IF(IFNULL(OTHERS.city,CLIPROFILE.city)='','Others',IFNULL(OTHERS.city,CLIPROFILE.city)) as city,
            01,2014,
            IFNULL(CLIPROFILE.cnt,0) as events_profile,
            IFNULL(OTHERS.cnt,0) as events_others           
    FROM
    (
        SELECT idClient,CONCAT(city,', ',country) as city,count(*) as cnt 
        FROM `MONTH_RAW_EVENTS` WHERE `ts`>'2014-01-01 00:00:00' AND `ts`<='2014-01-31 23:59:59'
        AND `event_category`!='CLIENT PROFILE'
        GROUP BY idClient,city
    ) as OTHERS
 RIGHT JOIN 
    (
        SELECT idClient,CONCAT(city,', ',country) as city,count(*) as cnt 
        FROM `MONTH_RAW_EVENTS` WHERE `ts`>'2014-01-01 00:00:00' AND `ts`<='2014-01-31 23:59:59'
        AND `event_category`='CLIENT PROFILE'
        GROUP BY idClient,city
    ) as CLIPROFILE 
    ON CLIPROFILE.city=OTHERS.city and CLIPROFILE.idClient=OTHERS.idClient

我们正在尝试做的是在MySQL中进行FULL外连接,因此查询的基本模式如下:the one proposed here

我们如何优化查询?我们一直在尝试使用不同的索引,将它们夹在中间,但是8小时后仍未完成运行,

MySQL服务器是Percona MySQL 5.5专用计算机,具有2cpu,2GB内存和SSD磁盘,
我们使用Percona工具优化了此类服务器的配置,

任何帮助将非常感激,

谢谢

解决方法:

您正在执行UNION,导致DISTINCT处理.

通常最好将“完全连接”重写为“左连接”,再加上“右连接”的不匹配行(如果合适的是1:n连接)

OTHERS LEFT JOIN CLIPROFILE 
ON CLIPROFILE.city=OTHERS.city and CLIPROFILE.idClient=OTHERS.idClient
union all
OTHERS RIGHT JOIN CLIPROFILE 
ON CLIPROFILE.city=OTHERS.city and CLIPROFILE.idClient=OTHERS.idClient
WHERE OTHERS.idClient IS NULL 

另外,您可以在将临时表的结果具体化后再将它们合并到临时表中,因此计算仅执行一次(我不知道MySQL的优化器是否足够聪明,可以自动执行此操作).

另外,将城市/国家/地区分组并加入为单独的列,然后在外部步骤中将CONCAT(city,’,’,/ country)用作城市可能会更有效.

标签:query-performance,outer-join,sql,mysql
来源: https://codeday.me/bug/20191122/2056277.html