使用grouping sets语法与拆分sql查询时间对比
作者:互联网
背景:为满足业务方分析需求,数仓基础统计中存在大量多维度的指标计算。原有计算中根据不同维度拆分多个sql串行提交hive集群进行查询,多个任务在map阶段存在重复读取源数据等问题,因此引入grouping sets语法合并业务逻辑为一条sql。
一次map一次reduce完成一个基础指标的统计,减少集群资源浪费。现以安卓国内版天活跃启动统计任务为例进行测试。
1.sql1 按province,model维度统计source为dw的前后台启动活跃数据。
sql1
SELECT SOURCE,
province,
model,
sum(front_start_pv),
sum(if(front_start_pv>0,1,0)),
sum(all_start_pv),
sum(1)
FROM
(SELECT SOURCE,
province,
model,
deviceId,
sum(front_start_pv) AS front_start_pv,
sum(all_start_pv) AS all_start_pv
FROM dw_wps_android.dwd_app_start_d
WHERE dt='2020-04-04'
AND SOURCE='dw'
GROUP BY SOURCE,
province,
model,
deviceId) a
GROUP BY SOURCE,
province,
model
2.sql2 按channel,province,model维度统计source为dw的前后台启动活跃数据。
sql2
SELECT SOURCE,
channel,
province,
model,
sum(front_start_pv),
sum(if(front_start_pv>0,1,0)),
sum(all_start_pv),
sum(1)
FROM
(SELECT SOURCE,
channel,
province,
model,
deviceId,
sum(front_start_pv) AS front_start_pv,
sum(all_start_pv) AS all_start_pv
FROM dw_wps_android.dwd_app_start_d
WHERE dt='2020-04-04'
AND SOURCE='dw'
GROUP BY SOURCE,
channel,
province,
model,
deviceId) a
GROUP BY SOURCE,
channel,
province,
model
sql3 按app_version,province,model维度统计source为dw的前后台启动活跃数据。
sql3
SELECT SOURCE,
app_version,
province,
model,
sum(front_start_pv),
sum(if(front_start_pv>0,1,0)),
sum(all_start_pv),
sum(1)
FROM
(SELECT SOURCE,
app_version,
province,
model,
deviceId,
sum(front_start_pv) AS front_start_pv,
sum(all_start_pv) AS all_start_pv
FROM dw_wps_android.dwd_app_start_d
WHERE dt='2020-04-04'
AND SOURCE='dw'
GROUP BY SOURCE,
app_version,
province,
model,
deviceId) a
GROUP BY SOURCE,
app_version,
province,
model
4.sql4 按app_version,channel,province,model维度统计source为dw的前后台启动活跃数据。
sql4
SELECT SOURCE,
app_version,
channel,
province,
model,
sum(front_start_pv),
sum(if(front_start_pv>0,1,0)),
sum(all_start_pv),
sum(1)
FROM
(SELECT SOURCE,
app_version,
channel,
province,
model,
deviceId,
sum(front_start_pv) AS front_start_pv,
sum(all_start_pv) AS all_start_pv
FROM dw_wps_android.dwd_app_start_d
WHERE dt='2020-04-04'
AND SOURCE='dw'
GROUP BY SOURCE,
app_version,
channel,
province,
model,
deviceId) a
GROUP BY SOURCE,
app_version,
channel,
province,
model
5.sql5 按province,model维度统计source为所有平台的前后台启动活跃数据
sql5
SELECT 'all' AS SOURCE,
province,
model,
sum(front_start_pv),
sum(if(front_start_pv>0,1,0)),
sum(all_start_pv),
sum(1)
FROM
(SELECT 'all' AS SOURCE,
province,
model,
deviceId,
sum(front_start_pv) AS front_start_pv,
sum(all_start_pv) AS all_start_pv
FROM dw_wps_android.dwd_app_start_d
WHERE dt='2020-04-04'
GROUP BY 'all',
province,
model,
deviceId) a
GROUP BY 'all',
province,
model
6.sql6 按channel,province,model维度统计source为所有平台的前后台启动活跃数据
sql6
SELECT 'all' AS SOURCE,
channel,
province,
model,
sum(front_start_pv),
sum(if(front_start_pv>0,1,0)),
sum(all_start_pv),
sum(1)
FROM
(SELECT 'all' AS SOURCE,
channel,
province,
model,
deviceId,
sum(front_start_pv) AS front_start_pv,
sum(all_start_pv) AS all_start_pv
FROM dw_wps_android.dwd_app_start_d
WHERE dt='2020-04-04'
GROUP BY 'all',
channel,
province,
model,
deviceId) a
GROUP BY 'all',
channel,
province,
model
7.sql7 按app_version,province,model维度统计source为所有平台的前后台启动活跃数据
sql7
SELECT 'all' AS SOURCE,
app_version,
province,
model,
sum(front_start_pv),
sum(if(front_start_pv>0,1,0)),
sum(all_start_pv),
sum(1)
FROM
(SELECT 'all' AS SOURCE,
app_version,
province,
model,
deviceId,
sum(front_start_pv) AS front_start_pv,
sum(all_start_pv) AS all_start_pv
FROM dw_wps_android.dwd_app_start_d
WHERE dt='2020-04-04'
GROUP BY 'all',
app_version,
province,
model,
deviceId) a
GROUP BY 'all',
app_version,
province,
model
8.sql8 按app_version,channel,province,model维度统计source为所有平台的前后台启动活跃数据
sql8
SELECT 'all' AS SOURCE,
app_version,
channel,
province,
model,
sum(front_start_pv),
sum(if(front_start_pv>0,1,0)),
sum(all_start_pv),
sum(1)
FROM
(SELECT 'all' AS SOURCE,
app_version,
channel,
province,
model,
deviceId,
sum(front_start_pv) AS front_start_pv,
sum(all_start_pv) AS all_start_pv
FROM dw_wps_android.dwd_app_start_d
WHERE dt='2020-04-04'
GROUP BY 'all',
app_version,
channel,
province,
model,
deviceId) a
GROUP BY 'all',
app_version,
channel,
province,
model
9.sql9 合并source,app_version,channel,province,model维度组合统计前后台启动活跃数据
sql9
SELECT nvl(source,"all") as source,
app_version,
channel,
province,
model,
sum(front_start_pv) as front_start_pv,
sum(if(front_start_pv>0,1,0)) as front_start_uv,
sum(all_start_pv) as all_start_pv,
sum(1) as all_start_uv
FROM
(SELECT source,
nvl(app_version,"dw_null") as app_version,
nvl(channel,"dw_null") as channel,
nvl(province,"dw_null") as province,
nvl(model,"dw_null") as model,
deviceId,
sum(front_start_pv) as front_start_pv,
sum(all_start_pv) as all_start_pv
FROM dw_wps_android.dwd_app_start_d
WHERE dt='2020-04-04'
GROUP BY source,
nvl(app_version,"dw_null"),
nvl(channel,"dw_null"),
nvl(province,"dw_null"),
nvl(model,"dw_null"),
deviceId) a
GROUP BY source,
app_version,
channel,
province,
model
GROUPING SETS(
(source,province,model),
(source,channel,province,model),
(source,app_version,province,model),
(source,app_version,channel,province,model),
(province,model),
(channel,province,model),
(app_version,province,model),
(app_version,channel,province,model)
)HAVING source='dw' OR source is null
sql行号 | 数据行 | time(秒) |
---|---|---|
sql1 | 191618 | 54.289 |
sql2 | 349692 | 186.34 |
sql3 | 811334 | 48.341 |
sql4 | 1059390 | 50.373 |
sql5 | 197314 | 66.308 |
sql6 | 405496 | 68.312 |
sql7 | 1077795 | 152.358 |
sql8 | 1376569 | 182.347 |
总和 | 5469208 | 808.668 |
sql行号 | 数据行 | time(秒) |
---|---|---|
sql9 | 5469208 | 134.432 |
说明:sql都是都测试队列相同集群环境下测试的。都是在container复用的情况下进行测试,排除了第一个sql从0申请资源执行时间过长的因素。
结论:采用grouping set语法进行sql合并能有效减少hive任务执行时间,同时以上9个sql占用的集群资源基本相当,优化后会大量减少集群资源的消耗。之后会测试数仓批量任务执行优化前后的时间对比,进行更全面的对比。
标签:province,pv,sum,sql,start,sets,model,app,grouping 来源: https://blog.csdn.net/a80090023/article/details/123184760