hive 开窗函数OVER(PARTITION)详解(二)
作者:互联网
窗口函数有哪些?
窗口函数可以分为以下 3 类:
- 聚合(Aggregate):AVG(), COUNT(), MIN(), MAX(), SUM()...
sum(col) over() : 分组对col累计求和
count(col) over() : 分组对col累计计数
min(col) over() : 分组对col求最小
max(col) over() : 分组求col的最大值
avg(col) over() : 分组求col列的平均值
- 取值(Value):FIRST_VALUE(), LAST_VALUE(), LEAD(), LAG()...
first_value(col) over() : 某分区排序后的第一个col值
last_value(col) over() : 某分区排序后的最后一个col值
lag(col,n,DEFAULT) : 统计往前n行的col值,n可选,默认为1,DEFAULT当往上第n行为NULL时候,取默认值,如不指定,则为NULL
lead(col,n,DEFAULT) : 统计往后n行的col值,n可选,默认为1,DEFAULT当往下第n行为NULL时候,取默认值,如不指定,则为NULL
- 排序(Ranking):RANK(), DENSE_RANK(), ROW_NUMBER(), NTILE()...
ntile(n) : 用于将分组数据按照顺序切分成n片,返回当前切片值。注意:n必须为int类型。
row_number() over() : 排名函数,不会重复,适合于生成主键或者不并列排名
rank() over() : 排名函数,有并列名次,名次不连续。如:1,1,3
dense_rank() over() : 排名函数,有并列名次,名次连续。如:1,1,2
注:Frame 定义并非所有窗口函数都适用,比如 ROW_NUMBER()、RANK()、LEAD() 等。这些函数总是应用于整个分区,而非当前 Frame。
窗口函数使用差异
聚合-窗口
测试数据准备:表名:ptf_over_test_202105_1
table_name:tmp_hotel.ptf_over_test_202105_1
cookieid createtime pv
cookie1 2015-04-10 1
cookie1 2015-04-11 5
cookie1 2015-04-12 7
cookie1 2015-04-13 3
cookie1 2015-04-14 2
cookie1 2015-04-15 4
cookie1 2015-04-16 4
查询语句:
SELECT cookieid,
createtime,
pv,
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1
SUM(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --当前行+往前3行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---当前行+往后所有行
FROM ptf_over_test_202105_1
查询结果:
排序-窗口
测试数据准备:表名:ptf_over_test_202105_2
table_name:tmp_hotel.ptf_over_test_202105_2
cookieid createtime pv
cookie1 2015-04-10 1
cookie1 2015-04-11 5
cookie1 2015-04-12 7
cookie1 2015-04-13 3
cookie1 2015-04-14 2
cookie1 2015-04-15 4
cookie1 2015-04-16 4
cookie2 2015-04-10 2
cookie2 2015-04-11 3
cookie2 2015-04-12 5
cookie2 2015-04-13 6
cookie2 2015-04-14 3
cookie2 2015-04-15 9
cookie2 2015-04-16 7
NTILE 分片
NTILE(n),用于将分组数据按照顺序切分成n片,返回当前切片值
NTILE不支持ROWS BETWEEN,比如 NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
如果切片不均匀,默认增加第一个切片的分布
查询语句:
SELECT
cookieid,
createtime,
pv,
NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn1, --分组内将数据分成2片
NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn2, --分组内将数据分成3片
NTILE(4) OVER(ORDER BY createtime) AS rn3 --将所有数据分成4片
FROM ptf_over_test_202105_2
ORDER BY cookieid,createtime
limit 100
查询结果:
ROW_NUMBER
ROW_NUMBER() –从1开始,按照顺序,生成分组内记录的序列
–比如,按照pv降序排列,生成分组内每天的pv名次
ROW_NUMBER() 的应用场景非常多,再比如,获取分组内排序第一的记录;获取一个session中的第一条refer等。
RANK 和 DENSE_RANK
—RANK() 生成数据项在分组中的排名,排名相等会在名次中留下空位
—DENSE_RANK() 生成数据项在分组中的排名,排名相等会在名次中不会留下空位
查询语句:
SELECT
cookieid,
createtime,
pv,
RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,
DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3
FROM tmp_hotel.ptf_over_test_202105_2
WHERE cookieid = 'cookie1';
-- rn1:15号和16号并列第3,13号排第5
-- rn2:15号和16号并列第3,13号排第4
-- rn3:如果相等,则按记录值排序,生成唯一的次序,如果所有记录值都相等,或许会随机排
查询结果:
取值 - 开窗
测试数据准备:表名:ptf_over_test_202105_3
table_name:tmp_hotel.ptf_over_test_202105_3
cookieid createtime url
cookie1,2015-04-10 10:00:02,url2
cookie1,2015-04-10 10:00:00,url1
cookie1,2015-04-10 10:03:04,1url3
cookie1,2015-04-10 10:50:05,url6
cookie1,2015-04-10 11:00:00,url7
cookie1,2015-04-10 10:10:00,url4
cookie1,2015-04-10 10:50:01,url5
cookie2,2015-04-10 10:00:02,url22
cookie2,2015-04-10 10:00:00,url11
cookie2,2015-04-10 10:03:04,1url33
cookie2,2015-04-10 10:50:05,url66
cookie2,2015-04-10 11:00:00,url77
cookie2,2015-04-10 10:10:00,url44
cookie2,2015-04-10 10:50:01,url55
LAG
LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值
第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)
查询语句:
select cokkieid,
createtime,
url,
ROW_NUMBER()OVER(PARTITION BY cokkieid ORDER BY createtime) as rn,
LAG(createtime,1,'1970-01-01 00:00:00')OVER(PARTITION BY cokkieid ORDER BY createtime) as last_1_time,
LAG(createtime,2)OVER(PARTITION BY cokkieid ORDER BY createtime) as last_3_time
FROM ptf_over_test_202105_3
查询结果:
LEAD
与LAG相反
LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值
第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)
查询语句:
select cokkieid,
createtime,
url,
ROW_NUMBER()OVER(PARTITION BY cokkieid ORDER BY createtime) as rn,
LEAD(createtime,1,'1970-01-01 00:00:00')OVER(PARTITION BY cokkieid ORDER BY createtime) as next_1_time,
LEAD(createtime,2)OVER(PARTITION BY cokkieid ORDER BY createtime) as next_3_time
FROM ptf_over_test_202105_3
查询结果:
FIRST_VALUE
取分组内排序后,截止到当前行,第一个值
LAST_VALUE
取分组内排序后,截止到当前行,最后一个值
查询语句:
select cokkieid,
createtime,
url,
ROW_NUMBER()OVER(PARTITION BY cokkieid ORDER BY createtime) as rn,
FIRST_VALUE(url)OVER(PARTITION BY cokkieid ORDER BY createtime) as first1,
LAST_VALUE(url)OVER(PARTITION BY cokkieid ORDER BY createtime) as last1
FROM ptf_over_test_202105_3
查询结果:
下面还有一些不太常用的
CUME_DIST
–CUME_DIST 小于等于当前值的行数/分组内总行数
PERCENT_RANK
–PERCENT_RANK 分组内当前行的RANK值-1/分组内总行数-1
GROUPING SETS,GROUPING__ID,CUBE,ROLLUP
标签:10,over,04,OVER,PARTITION,hive,2015,createtime 来源: https://blog.csdn.net/liuya19921123/article/details/120511937