知行教育_访问咨询主题-增量采集
作者:互联网
文章目录
访问咨询主题-增量采集注意点:
业务系统中,数据表的后缀是 年_月 ,这就表明随着时间推移,被
采集的表的后缀是动态变化的。
我们的脚本也要做到这一点
这个功能要做成自动化的脚本,脚本能够每天定时执行
增量采集,一天执行一次
当天采集昨日数据
业务数据库 - > ODS的流程
业务系统中,数据表的后缀是 年_月 ,这就表明随着时间推移,被
采集的表的后缀是动态变化的。
我们的增量采集是按天来的,采集的数据能够带上按天过滤的条件
#!/bin/bash
# 需求:能够手动传入参数指定采集的日期。 如果不传入参数要求能够自动的识别前一天的日期
if [ $# -eq 0 ]
then
DATESTR=`date -d "-1 day" +%Y-%m-%d`
else
DATESTR=$1
fi
# Define var
SQOOP_HOME=/export/servers/sqoop
JDBCSTR="jdbc:mysql://192.168.10.10:3306/nev"
MYSQL_USERNAME=root
YEARSTR=`date -d "-1 day" +%Y`
MONTHSTR=`date -d "-1 day" +%m`
HIVE_DB=itcast_ods
MAP_NUMBER=3
echo "执行采集web_chat_ems_${YEARSTR}_${MONTHSTR}的任务......"
$SQOOP_HOME import \
--connect $JDBCSTR \
--username $MYSQL_USERNAME --password $MYSQL_PASSWORD \
--query "SELECT
id,create_date_time,session_id,sid,create_time,
seo_source,seo_keywords,ip,area,country,province,
city,origin_channel,user AS user_match,manual_time,
begin_time,end_time,last_customer_msg_time_stamp,
last_agent_msg_time_stamp,reply_msg_count,msg_count,
browser_name,os_info, CURRENT_DATE() AS start_time
FROM web_chat_ems_${YEARSTR}_${MONTHSTR}
WHERE create_time >= '${DATESTR} 00:00:00' AND create_time <= '${DATESTR} 23:59:59' and \$CONDITIONS" \
--fields-terminated-by '\t' \
--hcatalog-database $HIVE_DB \
--hcatalog-table web_chat_ems \
-m $MAP_NUMBER \
--split-by id
wait
echo "执行采集web_chat_text_ems_${YEARSTR}_${MONTHSTR}的任务......"
$SQOOP_HOME import \
--connect $JDBCSTR \
--username $MYSQL_USERNAME --password $MYSQL_PASSWORD \
--query "SELECT
id,referrer,from_url,landing_page_url,
url_title,platform_description,
other_params,history, CURRENT_DATE() AS start_time
FROM web_chat_text_ems_${YEARSTR}_${MONTHSTR}
WHERE 1=1 AND \$CONDITIONS" \
--fields-terminated-by '\t' \
--hcatalog-database $HIVE_DB \
--hcatalog-table web_chat_text_ems \
-m $MAP_NUMBER \
--split-by id
配置ODS -> DWD的流程
# 配置从ODS层到DWD层的脚本
# 需求:能够手动传入参数指定采集的日期。 如果不传入参数要求能够自动的识别前一天的日期
if [ $# -eq 0 ]
then
DATESTR=`date -d "-1 day" +%Y-%m-%d`
else
DATESTR=$1
fi
HIVE_HOME="/usr/bin/hive"
YEARSTR=`date -d "${DATESTR}" +%Y`
MONTHSTR=`date -d "${DATESTR}" +%m`
DAYSTR=`date -d "${DATESTR}" +%d`
# 计算季度
if [ ${MONTHSTR} -ge 1 ] && [ ${MONTHSTR} -le 3 ]
then
QUARTERSTR=1
elif [ ${MONTHSTR} -ge 4 ] && [ ${MONTHSTR} -le 6 ]
then
QUARTERSTR=2
elif [ ${MONTHSTR} -ge 7 ] && [ ${MONTHSTR} -le 9 ]
then
QUARTERSTR=3
else
QUARTERSTR=4
fi
echo "即将执行:${DATESTR} 这一天的ODS -> DWD的过程......"
$HIVE_HOME -e "
-- 动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
-- hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
-- 写入时ORC压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
INSERT INTO TABLE itcast_dwd.visit_consult_dwd partition(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
w1.session_id,
w1.sid,
UNIX_TIMESTAMP(w1.create_time, 'yyyy-MM-dd HH:mm:ss') AS create_time,
w1.seo_source,
w1.ip,
w1.area,
w1.country,
w1.province,
w1.city,
CAST(IF (w1.msg_count IS NULL, 0, w1.msg_count) AS INT) AS msg_count,
w1.origin_channel,
w2.referrer,
w2.from_url,
w2.landing_page_url,
w2.url_title,
w2.platform_description,
w2.other_params,
w2.history,
SUBSTRING(w1.create_time, 12, 2) AS hourinfo,
SUBSTRING(w1.create_time, 1, 4) AS yearinfo,
QUARTER(w1.create_time) AS quarterinfo,
SUBSTRING(w1.create_time, 6, 2) AS monthinfo,
SUBSTRING(w1.create_time, 9, 2) AS dayinfo
FROM (SELECT * FROM itcast_ods.web_chat_ems WHERE start_time='${DATESTR}') AS w1 INNER JOIN itcast_ods.web_chat_text_ems AS w2
ON w1.id = w2.id;"
DWD -> DWS 流程
问题:分析针对增量数据的不同做法以及优缺点
我们DWS中是包含了年、季度、月等维度的结果数据
由于新增了一天的数据,当前年、当前季度、当前月的结果数据就失效了
需要重新计算
问题在于:DWS表用的失效数据如何处理。
问题解决方式
方式1
删除过期数据
优点:
- 对BI友好,没有历史数据的混淆,直接取就是最新的
- 表中的数据清晰
缺点:
- 执行复杂
- 破坏了数仓设计中尽量不删除的原则
方式2:
增加新列,表名当前数据计算的时间
用的时候,取最新时间的即可
优点:
- 表中存储了历史结果的变更
- 不会执行删除,不破坏数仓的原则
缺点:
- 对BI分析来说,需要先过滤最新的数据(稍微不友好)
- 修改表结构(全量操作要重来)
方式3:
增加新表
每一天的结果都生成一个表(一天一个表)
优点:
- 每个表都是清晰的,指向了具体某天的结果
- 通过多个表的方式也记录的历史结果的变更
缺点:
- 数据冗余过多(只要业务需要,冗余不是事)
- 对BI不友好(换一天,换一个表,如果BI不支持动态规则配置自动变更表,就要手动变更)
这三种方式,没有绝对的谁好谁坏
合适的需求下,选择合适的方式即可。
过期数据的删除
# 删除过期年的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='当前年', quarterinfo='-1', monthinfo='-1', dayinfo='-1');
# 删除过期季度的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='当前年', quarterinfo='当前季度', monthinfo='-1', dayinfo='-1');
# 删除过期月的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='当前年', quarterinfo='当前季度', monthinfo='当前月', dayinfo='-1');
示例脚本
# DWD层数据聚合到DWS层的操作脚本(增量)
# 需求:能够手动传入参数指定采集的日期。 如果不传入参数要求能够自动的识别前一天的日期
if [ $# -eq 0 ]
then
DATESTR=`date -d "-1 day" +%Y-%m-%d`
else
DATESTR=$1
fi
HIVE_HOME="/usr/bin/hive"
YEARSTR=`date -d "${DATESTR}" +%Y`
MONTHSTR=`date -d "${DATESTR}" +%m`
DAYSTR=`date -d "${DATESTR}" +%d`
# 计算季度
if [ ${MONTHSTR} -ge 1 ] && [ ${MONTHSTR} -le 3 ]
then
QUARTERSTR=1
elif [ ${MONTHSTR} -ge 4 ] && [ ${MONTHSTR} -le 6 ]
then
QUARTERSTR=2
elif [ ${MONTHSTR} -ge 7 ] && [ ${MONTHSTR} -le 9 ]
then
QUARTERSTR=3
else
QUARTERSTR=4
fi
echo "先删除过期的${YEARSTR}年, ${QUARTERSTR}季度, ${MONTHSTR}月数据"
$HIVE_HOME -e "
# 删除过期年的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='${YEARSTR}', quarterinfo='-1', monthinfo='-1', dayinfo='-1');
# 删除过期季度的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='${YEARSTR}', quarterinfo='${QUARTERSTR}', monthinfo='-1', dayinfo='-1');
# 删除过期月的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='${YEARSTR}', quarterinfo='${QUARTERSTR}', monthinfo='${MONTHSTR}', dayinfo='-1');"
echo "执行新数据的计算,计算的日期是:${DATESTR}"
$HIVE_HOME -e "
/*
指标:访问量
维度:时间(年、季度、月、天、小时)、来源渠道、受访页面、搜索渠道、区域维度
*/
-- 时间维度表开发
-- 动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
-- hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
-- 写入时ORC压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
-- 统计某年的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
COUNT(DISTINCT sid) AS sid_total,
COUNT(DISTINCT session_id) AS sessionid_total,
COUNT(DISTINCT ip) AS ip_total,
'-1' AS country,
'-1' AS province,
'-1' AS city,
'-1' AS seo_source,
'-1' AS origin_channel,
'-1' AS hourinfo,
yearinfo AS time_str,
'-1' AS from_url,
'5' AS groupType,
'5' AS time_type,
yearinfo,
'-1' AS quarterinfo,
'-1' AS monthinfo,
'-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}'
GROUP BY yearinfo;
-- 统计某年某季度的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
COUNT(DISTINCT sid) AS sid_total,
COUNT(DISTINCT session_id) AS sessionid_total,
COUNT(DISTINCT ip) AS ip_total,
'-1' AS country,
'-1' AS province,
'-1' AS city,
'-1' AS seo_source,
'-1' AS origin_channel,
'-1' AS hourinfo,
CONCAT(yearinfo, '-Q', quarterinfo) AS time_str,
'-1' AS from_url,
'5' AS groupType,
'4' AS time_type,
yearinfo,
quarterinfo,
'-1' AS monthinfo,
'-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}'
GROUP BY yearinfo, quarterinfo;
-- 统计某年某季度某月的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
COUNT(DISTINCT sid) AS sid_total,
COUNT(DISTINCT session_id) AS sessionid_total,
COUNT(DISTINCT ip) AS ip_total,
'-1' AS country,
'-1' AS province,
'-1' AS city,
'-1' AS seo_source,
'-1' AS origin_channel,
'-1' AS hourinfo,
CONCAT(yearinfo, '-', monthinfo) AS time_str,
'-1' AS from_url,
'5' AS groupType,
'3' AS time_type,
yearinfo,
quarterinfo,
monthinfo,
'-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo;
-- 统计某年某季度某月某日的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
COUNT(DISTINCT sid) AS sid_total,
COUNT(DISTINCT session_id) AS sessionid_total,
COUNT(DISTINCT ip) AS ip_total,
'-1' AS country,
'-1' AS province,
'-1' AS city,
'-1' AS seo_source,
'-1' AS origin_channel,
'-1' AS hourinfo,
CONCAT(yearinfo, '-', monthinfo, '-', dayinfo) AS time_str,
'-1' AS from_url,
'5' AS groupType,
'2' AS time_type,
yearinfo,
quarterinfo,
monthinfo,
dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}' AND dayinfo = '${DAYSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo, dayinfo;
-- 统计某年某季度某月某日某小时的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
COUNT(DISTINCT sid) AS sid_total,
COUNT(DISTINCT session_id) AS sessionid_total,
COUNT(DISTINCT ip) AS ip_total,
'-1' AS country,
'-1' AS province,
'-1' AS city,
'-1' AS seo_source,
'-1' AS origin_channel,
hourinfo,
CONCAT(yearinfo, '-', monthinfo, '-', dayinfo, ' ', hourinfo) AS time_str,
'-1' AS from_url,
'5' AS groupType,
'1' AS time_type,
yearinfo,
quarterinfo,
monthinfo,
dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}' AND dayinfo = '${DAYSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo, dayinfo, hourinfo;
-- 按照时间和其它维度汇合进行聚合
-- 按照时间和国家进行组合计算
-- 动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
-- hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
-- 写入时ORC压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
-- 统计某年某国家的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
COUNT(DISTINCT sid) AS sid_total,
COUNT(DISTINCT session_id) AS sessionid_total,
COUNT(DISTINCT ip) AS ip_total,
country,
'-1' AS province,
'-1' AS city,
'-1' AS seo_source,
'-1' AS origin_channel,
'-1' AS hourinfo,
yearinfo AS time_str,
'-1' AS from_url,
'1' AS groupType,
'5' AS time_type,
yearinfo,
'-1' AS quarterinfo,
'-1' AS monthinfo,
'-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}'
GROUP BY yearinfo,country;
-- 统计某年某季度某国家的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
COUNT(DISTINCT sid) AS sid_total,
COUNT(DISTINCT session_id) AS sessionid_total,
COUNT(DISTINCT ip) AS ip_total,
country,
'-1' AS province,
'-1' AS city,
'-1' AS seo_source,
'-1' AS origin_channel,
'-1' AS hourinfo,
CONCAT(yearinfo, '-Q', quarterinfo) AS time_str,
'-1' AS from_url,
'1' AS groupType,
'4' AS time_type,
yearinfo,
quarterinfo,
'-1' AS monthinfo,
'-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}'
GROUP BY yearinfo, quarterinfo, country;
-- 统计某年某季度某月某国家的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
COUNT(DISTINCT sid) AS sid_total,
COUNT(DISTINCT session_id) AS sessionid_total,
COUNT(DISTINCT ip) AS ip_total,
country,
'-1' AS province,
'-1' AS city,
'-1' AS seo_source,
'-1' AS origin_channel,
'-1' AS hourinfo,
CONCAT(yearinfo, '-', monthinfo) AS time_str,
'-1' AS from_url,
'1' AS groupType,
'3' AS time_type,
yearinfo,
quarterinfo,
monthinfo,
'-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo, country;
-- 统计某年某季度某月某天某国家的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
COUNT(DISTINCT sid) AS sid_total,
COUNT(DISTINCT session_id) AS sessionid_total,
COUNT(DISTINCT ip) AS ip_total,
country,
'-1' AS province,
'-1' AS city,
'-1' AS seo_source,
'-1' AS origin_channel,
'-1' AS hourinfo,
CONCAT(yearinfo, '-', monthinfo, '-', dayinfo) AS time_str,
'-1' AS from_url,
'1' AS groupType,
'2' AS time_type,
yearinfo,
quarterinfo,
monthinfo,
dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}' AND dayinfo = '${DAYSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo, dayinfo, country;
-- 统计某年某季度某月某天某小时某国家的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
COUNT(DISTINCT sid) AS sid_total,
COUNT(DISTINCT session_id) AS sessionid_total,
COUNT(DISTINCT ip) AS ip_total,
country,
'-1' AS province,
'-1' AS city,
'-1' AS seo_source,
'-1' AS origin_channel,
hourinfo,
CONCAT(yearinfo, '-', monthinfo, '-', dayinfo, ' ', hourinfo) AS time_str,
'-1' AS from_url,
'1' AS groupType,
'1' AS time_type,
yearinfo,
quarterinfo,
monthinfo,
dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}' AND dayinfo = '${DAYSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo, dayinfo, hourinfo, country;
"
DWS -> MySQL
如果DWS用的是方式1,MySQL中也要删除过期
导出脚本执行之前需要在MySQL中删除即可
mysql -uroot -p123456 scrm_bi -e "
# 删除过期年
delete from itcast_visit WHERE yearinfo='${YEARSTR}' AND quarterinfo='-1' AND monthinfo='-1' AND dayinfo='-1';
# 删除过期季度
delete from itcast_visit WHERE yearinfo='${YEARSTR}' AND quarterinfo='${QUARTERSTR}' AND monthinfo='-1' AND dayinfo='-1';
# 删除过期月
delete from itcast_visit WHERE yearinfo='${YEARSTR}' AND quarterinfo='${QUARTERSTR}' AND monthinfo='${MONTHSTR}' AND dayinfo='-1';
"
sqoop export \
--connect "jdbc:mysql://192.168.52.150:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
--username root --password 123456 \
--table itcast_visit \
--hcatalog-database itcast_dws \
--hcatalog-table visit_dws \
--hcatalog-partition-keys yearinfo \
--hcatalog-partition-values ${YEARSTR} \
-m 1
标签:dayinfo,quarterinfo,itcast,采集,time,monthinfo,访问,yearinfo,增量 来源: https://blog.51cto.com/u_15277063/2929914