每半个小时间隔内用户拨打电话的次数以及时长
作者:互联网
大家好,今天看了一个分析函数的应用场景,分享出来
用户拨打电话表: 字段用户id,开始时间,结束时间。样例数据如下所示(分隔符为,):
aaa,2018-01-01 08:01:00,2018-01-01 08:08:00
aaa,2018-01-01 08:15:00,2018-01-01 08:20:00
aaa,2018-01-01 08:45:00,2018-01-01 08:48:00
期望输出, 用户id,每个时间段的最早开始时间, 该时间段内拨打电话的次数,时长(分钟)
aaa 2018-01-01 08:01:00 2 12
aaa 2018-01-01 08:45:00 1 3
以下为创建的测试表以及详细的步骤
create table login_start_end_time (userid string,start_date string,end_date string) row format delimited fields terminated by ',';
LOAD DATA LOCAL INPATH '/root/test/test.txt' INTO TABLE login_start_end_time
hive> select * from login_start_end_time;
OK
aaa 2018-01-01 08:01:00 2018-01-01 08:08:00
aaa 2018-01-01 08:15:00 2018-01-01 08:20:00
aaa 2018-01-01 08:45:00 2018-01-01 08:48:00
---第一步 求出每次的时长和上次的结束时间
select userid,start_date,end_date,unix_timestamp(end_date)-unix_timestamp(start_date) as long_time,lag(end_date,1,start_date) over(distribute by userid sort by start_date) as last_end_time from login_start_end_time;
---去掉中间的mr过程
tage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 12.08 sec HDFS Read: 9181 HDFS Write: 204 SUCCESS
Total MapReduce CPU Time Spent: 12 seconds 80 msec
OK
aaa 2018-01-01 08:01:00 2018-01-01 08:08:00 420 2018-01-01 08:01:00
aaa 2018-01-01 08:15:00 2018-01-01 08:20:00 300 2018-01-01 08:08:00
aaa 2018-01-01 08:45:00 2018-01-01 08:48:00 180 2018-01-01 08:20:00
---第二步 求出累计时长和每次之间的间隔时间
select userid,start_date,end_date,long_time,sum(long_time) over(distribute by userid sort by start_date) as sum_log,unix_timestamp(start_date)-unix_timestamp(last_end_time) as diff_long from (select userid,start_date,end_date,unix_timestamp(end_date)-unix_timestamp(start_date) as long_time,lag(end_date,1,start_date) over(distribute by userid sort by start_date) as last_end_time from login_start_end_time) t
--去掉中间的日志
Total MapReduce CPU Time Spent: 25 seconds 120 msec
OK
aaa 2018-01-01 08:01:00 2018-01-01 08:08:00 420 420 0
aaa 2018-01-01 08:15:00 2018-01-01 08:20:00 300 720 420
aaa 2018-01-01 08:45:00 2018-01-01 08:48:00 180 900 1500
---第三步 依据累计时长和每次之间的间隔时间,与30分钟取余,分出每30分钟的时间段
select userid,start_date,end_date,long_time,floor((sum_log+diff_long)/(30*60)) as time_inter from (select userid,start_date,end_date,long_time,sum(long_time) over(distribute by userid sort by start_date) as sum_log,unix_timestamp(start_date)-unix_timestamp(last_end_time) as diff_long from (select userid,start_date,end_date,unix_timestamp(end_date)-unix_timestamp(start_date) as long_time,lag(end_date,1,start_date) over(distribute by userid sort by start_date) as last_end_time from login_start_end_time) t) d
--去掉中间的日志
Total MapReduce CPU Time Spent: 25 seconds 30 msec
OK
aaa 2018-01-01 08:01:00 2018-01-01 08:08:00 420 0
aaa 2018-01-01 08:15:00 2018-01-01 08:20:00 300 0
aaa 2018-01-01 08:45:00 2018-01-01 08:48:00 180 1
---第四步 依据时间段,求出每段的开始时间,次数,以及时长(分钟)
select userid,time_inter+1,min(start_date) as start_date,count(1) cnt,sum(long_time)/60 as long_time from (select userid,start_date,end_date,long_time,floor((sum_log+diff_long)/(30*60)) as time_inter from (select userid,start_date,end_date,long_time,sum(long_time) over(distribute by userid sort by start_date) as sum_log,unix_timestamp(start_date)-unix_timestamp(last_end_time) as diff_long from (select userid,start_date,end_date,unix_timestamp(end_date)-unix_timestamp(start_date) as long_time,lag(end_date,1,start_date) over(distribute by userid sort by start_date) as last_end_time from login_start_end_time) t) d ) d1
group by userid,time_inter+1
--去掉中间的日志
Total MapReduce CPU Time Spent: 39 seconds 0 msec
OK
aaa 1 2018-01-01 08:01:00 2 12.0
aaa 2 2018-01-01 08:45:00 1 3.0
Time taken: 294.195 seconds, Fetched: 2 row(s)
个人理解: 关键构建时间段间隔字段,区分哪些行属于同一个时间段
标签:00,01,start,拨打,间隔,次数,2018,date,08 来源: https://blog.csdn.net/zhaoxiangchong/article/details/115319221