大数据实战(三十四):电商数仓(二十七)之用户行为数据仓库(十三)用户留存主题
作者:互联网
1 需求目标
1.1 用户留存概念
1.2 需求描述
用户留存分析
2 DWS层
2.1 DWS层(每日留存用户明细表)
1)建表语句
hive (gmall)> drop table if exists dws_user_retention_day; create external table dws_user_retention_day ( `mid_id` string COMMENT '设备唯一标识', `user_id` string COMMENT '用户标识', `version_code` string COMMENT '程序版本号', `version_name` string COMMENT '程序版本名', `lang` string COMMENT '系统语言', `source` string COMMENT '渠道号', `os` string COMMENT '安卓系统版本', `area` string COMMENT '区域', `model` string COMMENT '手机型号', `brand` string COMMENT '手机品牌', `sdk_version` string COMMENT 'sdkVersion', `gmail` string COMMENT 'gmail', `height_width` string COMMENT '屏幕宽高', `app_time` string COMMENT '客户端日志产生时的时间', `network` string COMMENT '网络模式', `lng` string COMMENT '经度', `lat` string COMMENT '纬度', `create_date` string comment '设备新增时间', `retention_day` int comment '截止当前日期留存天数' ) COMMENT '每日用户留存情况' PARTITIONED BY (`dt` string) stored as parquet location '/warehouse/gmall/dws/dws_user_retention_day/';View Code
2)
=============================留存主题=========================
-----------------------------需求1.DWS层每日留存用户明细表-----------------------
-----------------------------相关表---------------------
dws_new_mid_day:每日的新增用户表
dws_uv_detail_day:日活表
-----------------------------思路-----------------------
明细信息:从dws_uv_detail_day(日活表)取
create_date: 设备的新增日期(哪一天称为新用户的)
从dws_new_mid_day根据mid_id查询
retention_day: 截至到当前日期留存的天数
dt(日活数据的日期)=create_date+retention_day
-----------------------------SQL------------------------
--求1日留存
-- 先过滤,再关联比较好
SELECT
t1.mid_id,
t1.user_id,
t1.version_code,
t1.version_name,
t1.lang,
t1.source,
t1.os,
t1.area,
t1.model,
t1.brand,
t1.sdk_version,
t1.gmail,
t1.height_width,
t1.app_time,
t1.network,
t1.lng,
t1.lat,
t2.create_date,
1 retention_day,
'2020-02-15'
FROM
(SELECT * from gmall.dws_uv_detail_day where dt='2020-02-15') t1
JOIN
(select mid_id,create_date from gmall.dws_new_mid_day where create_date=date_sub('2020-02-15',1)) t2
on t1.mid_id=t2.mid_id
----------------------求1,2,3,n天的留存明细----------------------------
insert overwrite TABLE dws_user_retention_day PARTITION(dt='2020-02-15')
SELECT
t1.mid_id,
t1.user_id,
t1.version_code,
t1.version_name,
t1.lang,
t1.source,
t1.os,
t1.area,
t1.model,
t1.brand,
t1.sdk_version,
t1.gmail,
t1.height_width,
t1.app_time,
t1.network,
t1.lng,
t1.lat,
t2.create_date,
1 retention_day
FROM
(SELECT * from gmall.dws_uv_detail_day where dt='2020-02-15') t1
JOIN
(select mid_id,create_date from gmall.dws_new_mid_day where create_date=date_sub('2020-02-15',1)) t2
on t1.mid_id=t2.mid_id
UNION all
SELECT
t1.mid_id,
t1.user_id,
t1.version_code,
t1.version_name,
t1.lang,
t1.source,
t1.os,
t1.area,
t1.model,
t1.brand,
t1.sdk_version,
t1.gmail,
t1.height_width,
t1.app_time,
t1.network,
t1.lng,
t1.lat,
t2.create_date,
2 retention_day
FROM
(SELECT * from gmall.dws_uv_detail_day where dt='2020-02-15') t1
JOIN
(select mid_id,create_date from gmall.dws_new_mid_day where create_date=date_sub('2020-02-15',2)) t2
on t1.mid_id=t2.mid_id
UNION all
SELECT
t1.mid_id,
t1.user_id,
t1.version_code,
t1.version_name,
t1.lang,
t1.source,
t1.os,
t1.area,
t1.model,
t1.brand,
t1.sdk_version,
t1.gmail,
t1.height_width,
t1.app_time,
t1.network,
t1.lng,
t1.lat,
t2.create_date,
3 retention_day
FROM
(SELECT * from gmall.dws_uv_detail_day where dt='2020-02-15') t1
JOIN
(select mid_id,create_date from gmall.dws_new_mid_day where create_date=date_sub('2020-02-15',3)) t2
on t1.mid_id=t2.mid_id
--union all在使用时要求拼接的SQL,字段数量和类型需要一致!
--union all和union区别,union去重,union all不去重!
2.3 Union与Union all区别
1)准备两张表
tableA tableB
id name score id name score
1 a 80 1 d 48
2 b 79 2 e 23
3 c 68 3 c 86
2)采用union查询
select name from tableA
union
select name from tableB
查询结果
name
a
d
b
e
c
3)采用union all查询
select name from tableA
union all
select name from tableB
查询结果
name
a
b
c
d
e
c
4)总结
(1)union会将联合的结果集去重,效率较union all差
(2)union all不会对结果集去重,所以效率高
3 ADS层
3.1 留存用户数
1)建表语句
hive (gmall)> drop table if exists ads_user_retention_day_count; create external table ads_user_retention_day_count ( `create_date` string comment '设备新增日期', `retention_day` int comment '截止当前日期留存天数', `retention_count` bigint comment '留存数量' ) COMMENT '每日用户留存情况' row format delimited fields terminated by '\t' location '/warehouse/gmall/ads/ads_user_retention_day_count/';View Code
2)
-----------------------------需求2.统计ads_user_retention_day_count每日留存用户的数量-----------------------
-----------------------------相关表---------------------
dws_user_retention_day
-----------------------------思路-----------------------
create_date: 从dws_user_retention_day查询
retention_day: 从dws_user_retention_day查询
retention_count: 使用count(*)统计
先根据create_date过滤指定的新增日期日期用户的设备明细!
再根据retention_day分组,之后count(*)
-----------------------------SQL------------------------
insert into table gmall.ads_user_retention_day_count
SELECT
'2020-02-14',
retention_day,
count(*)
FROM gmall.dws_user_retention_day
where create_date='2020-02-14'
group by retention_day;
3.2 留存用户比率
1)建表语句
hive (gmall)> drop table if exists ads_user_retention_day_rate; create external table ads_user_retention_day_rate ( `stat_date` string comment '统计日期', `create_date` string comment '设备新增日期', `retention_day` int comment '截止当前日期留存天数', `retention_count` bigint comment '留存数量', `new_mid_count` bigint comment '当日设备新增数量', `retention_ratio` decimal(10,2) comment '留存率' ) COMMENT '每日用户留存情况' row format delimited fields terminated by '\t' location '/warehouse/gmall/ads/ads_user_retention_day_rate/';View Code
2)
-----------------------------需求3. 求留存率-----------------------
-----------------------------相关表---------------------
ads_user_retention_day_count
ads_new_mid_count
从以上两表取出同一条新增的设备的信息,因此设备的新增日期是关联的字段
-----------------------------思路-----------------------
`stat_date` : 一般是当前要统计数据的当天或后一天。不早于统计数据的日期!
`create_date` : 从ads_user_retention_day_count取
`retention_day` : 从ads_user_retention_day_count取
`retention_count` : 从ads_user_retention_day_count取
`new_mid_count` : 从ads_new_mid_count统计当前新增设备的数量
`retention_ratio` : retention_count/new_mid_count
-----------------------------SQL------------------------
insert into table ads_user_retention_day_rate
SELECT
'2020-02-16',
ur.create_date,
ur.retention_day,
ur.retention_count,
nm.new_mid_count,
cast (ur.retention_count / nm.new_mid_count as decimal(10,2))
FROM
ads_user_retention_day_count ur
JOIN
ads_new_mid_count nm
on ur.create_date=nm.create_date
where date_add(ur.create_date,ur.retention_day)='2020-02-16'
标签:商数,mid,create,数据仓库,用户,t1,date,day,retention 来源: https://www.cnblogs.com/qiu-hua/p/13542762.html