数据库
首页 > 数据库> > 使用grouping sets语法与拆分sql查询时间对比

使用grouping sets语法与拆分sql查询时间对比

作者:互联网

背景:为满足业务方分析需求,数仓基础统计中存在大量多维度的指标计算。原有计算中根据不同维度拆分多个sql串行提交hive集群进行查询,多个任务在map阶段存在重复读取源数据等问题,因此引入grouping sets语法合并业务逻辑为一条sql。

一次map一次reduce完成一个基础指标的统计,减少集群资源浪费。现以安卓国内版天活跃启动统计任务为例进行测试。

1.sql1 按province,model维度统计source为dw的前后台启动活跃数据。

sql1
SELECT SOURCE,
       province,
       model,
       sum(front_start_pv),
       sum(if(front_start_pv>0,1,0)),
       sum(all_start_pv),
       sum(1)
FROM
  (SELECT SOURCE,
          province,
          model,
          deviceId,
          sum(front_start_pv) AS front_start_pv,
          sum(all_start_pv) AS all_start_pv
   FROM dw_wps_android.dwd_app_start_d
   WHERE dt='2020-04-04'
     AND SOURCE='dw'
   GROUP BY SOURCE,
            province,
            model,
            deviceId) a
GROUP BY SOURCE,
         province,
         model

2.sql2 按channel,province,model维度统计source为dw的前后台启动活跃数据。 

sql2
SELECT SOURCE,
       channel,
       province,
       model,
       sum(front_start_pv),
       sum(if(front_start_pv>0,1,0)),
       sum(all_start_pv),
       sum(1)
FROM
  (SELECT SOURCE,
          channel,
          province,
          model,
          deviceId,
          sum(front_start_pv) AS front_start_pv,
          sum(all_start_pv) AS all_start_pv
   FROM dw_wps_android.dwd_app_start_d
   WHERE dt='2020-04-04'
     AND SOURCE='dw'
   GROUP BY SOURCE,
            channel,
            province,
            model,
            deviceId) a
GROUP BY SOURCE,
         channel,
         province,
         model

 

 sql3 按app_version,province,model维度统计source为dw的前后台启动活跃数据。

sql3
SELECT SOURCE,
app_version,
province,
model,
sum(front_start_pv),
sum(if(front_start_pv>0,1,0)),
sum(all_start_pv),
sum(1)
FROM
(SELECT SOURCE,
app_version,
province,
model,
deviceId,
sum(front_start_pv) AS front_start_pv,
sum(all_start_pv) AS all_start_pv
FROM dw_wps_android.dwd_app_start_d
WHERE dt='2020-04-04'
AND SOURCE='dw'
GROUP BY SOURCE,
app_version,
province,
model,
deviceId) a
GROUP BY SOURCE,
app_version,
province,
model

 

 4.sql4 按app_version,channel,province,model维度统计source为dw的前后台启动活跃数据。

sql4
SELECT SOURCE,
app_version,
channel,
province,
model,
sum(front_start_pv),
sum(if(front_start_pv>0,1,0)),
sum(all_start_pv),
sum(1)
FROM
(SELECT SOURCE,
app_version,
channel,
province,
model,
deviceId,
sum(front_start_pv) AS front_start_pv,
sum(all_start_pv) AS all_start_pv
FROM dw_wps_android.dwd_app_start_d
WHERE dt='2020-04-04'
AND SOURCE='dw'
GROUP BY SOURCE,
app_version,
channel,
province,
model,
deviceId) a
GROUP BY SOURCE,
app_version,
channel,
province,
model

 

 5.sql5 按province,model维度统计source为所有平台的前后台启动活跃数据

sql5
SELECT 'all' AS SOURCE,
       province,
       model,
       sum(front_start_pv),
       sum(if(front_start_pv>0,1,0)),
       sum(all_start_pv),
       sum(1)
FROM
  (SELECT 'all' AS SOURCE,
          province,
          model,
          deviceId,
          sum(front_start_pv) AS front_start_pv,
          sum(all_start_pv) AS all_start_pv
   FROM dw_wps_android.dwd_app_start_d
   WHERE dt='2020-04-04'
   GROUP BY 'all',
            province,
            model,
            deviceId) a
GROUP BY 'all',
         province,
         model

 

 6.sql6 按channel,province,model维度统计source为所有平台的前后台启动活跃数据

sql6
SELECT 'all' AS SOURCE,
       channel,
       province,
       model,
       sum(front_start_pv),
       sum(if(front_start_pv>0,1,0)),
       sum(all_start_pv),
       sum(1)
FROM
  (SELECT 'all' AS SOURCE,
          channel,
          province,
          model,
          deviceId,
          sum(front_start_pv) AS front_start_pv,
          sum(all_start_pv) AS all_start_pv
   FROM dw_wps_android.dwd_app_start_d
   WHERE dt='2020-04-04'
   GROUP BY 'all',
            channel,
            province,
            model,
            deviceId) a
GROUP BY 'all',
         channel,
         province,
         model

 

 7.sql7 按app_version,province,model维度统计source为所有平台的前后台启动活跃数据

sql7
 SELECT 'all' AS SOURCE,
       app_version,
       province,
       model,
       sum(front_start_pv),
       sum(if(front_start_pv>0,1,0)),
       sum(all_start_pv),
       sum(1)
FROM
  (SELECT 'all' AS SOURCE,
          app_version,
          province,
          model,
          deviceId,
          sum(front_start_pv) AS front_start_pv,
          sum(all_start_pv) AS all_start_pv
   FROM dw_wps_android.dwd_app_start_d
   WHERE dt='2020-04-04'
   GROUP BY 'all',
            app_version,
            province,
            model,
            deviceId) a
GROUP BY 'all',
         app_version,
         province,
         model

 

 8.sql8 按app_version,channel,province,model维度统计source为所有平台的前后台启动活跃数据

sql8
SELECT 'all' AS SOURCE,
       app_version,
       channel,
       province,
       model,
       sum(front_start_pv),
       sum(if(front_start_pv>0,1,0)),
       sum(all_start_pv),
       sum(1)
FROM
  (SELECT 'all' AS SOURCE,
          app_version,
          channel,
          province,
          model,
          deviceId,
          sum(front_start_pv) AS front_start_pv,
          sum(all_start_pv) AS all_start_pv
   FROM dw_wps_android.dwd_app_start_d
   WHERE dt='2020-04-04'
   GROUP BY 'all',
            app_version,
            channel,
            province,
            model,
            deviceId) a
GROUP BY 'all',
         app_version,
         channel,
         province,
         model 

 

 9.sql9 合并source,app_version,channel,province,model维度组合统计前后台启动活跃数据

sql9
SELECT nvl(source,"all") as source,
       app_version,
       channel,
       province,
       model,
       sum(front_start_pv) as front_start_pv,
       sum(if(front_start_pv>0,1,0)) as front_start_uv,
       sum(all_start_pv) as all_start_pv,
       sum(1) as all_start_uv
FROM
  (SELECT source,
          nvl(app_version,"dw_null") as app_version,
          nvl(channel,"dw_null") as channel,
          nvl(province,"dw_null") as province,
          nvl(model,"dw_null") as model,
          deviceId,
          sum(front_start_pv) as front_start_pv,
          sum(all_start_pv) as all_start_pv
   FROM dw_wps_android.dwd_app_start_d
   WHERE dt='2020-04-04'
   GROUP BY source,
            nvl(app_version,"dw_null"),
            nvl(channel,"dw_null"),
            nvl(province,"dw_null"),
            nvl(model,"dw_null"),
            deviceId) a
GROUP BY source,
         app_version,
         channel,
         province,
         model
GROUPING SETS(
(source,province,model),
(source,channel,province,model),
(source,app_version,province,model),
(source,app_version,channel,province,model),
(province,model),
(channel,province,model),
(app_version,province,model),
(app_version,channel,province,model)
)HAVING source='dw' OR source is null

 

 

sql行号

数据行

time(秒)

sql1

191618

54.289

sql2

349692

186.34

sql3

811334

48.341

sql4

1059390

50.373

sql5

197314

66.308

sql6

405496

68.312

sql7

1077795

152.358

sql8

1376569

182.347

总和

5469208

808.668

sql行号

数据行

time(秒)

sql9

5469208

134.432

说明:sql都是都测试队列相同集群环境下测试的。都是在container复用的情况下进行测试,排除了第一个sql从0申请资源执行时间过长的因素。

结论:采用grouping set语法进行sql合并能有效减少hive任务执行时间,同时以上9个sql占用的集群资源基本相当,优化后会大量减少集群资源的消耗。之后会测试数仓批量任务执行优化前后的时间对比,进行更全面的对比。

标签:province,pv,sum,sql,start,sets,model,app,grouping
来源: https://blog.csdn.net/a80090023/article/details/123184760