首页 > 其他分享> > Hive基础(三十九)：Hive DML (三) 分桶及抽样查询/其他常用查询函数

Hive基础(三十九)：Hive DML (三) 分桶及抽样查询/其他常用查询函数

2021-06-12 11:32:24 作者：互联网

6 分桶及抽样查询

6.1 分桶表数据存储 分区提供一个隔离数据和优化查询的便利方式。不过，并非所有的数据集都可形成合理的分区。对于一张表或者分区，Hive 可以进一步组织成桶，也就是更为细粒度的数据范围划分。 分桶是将数据集分解成更容易管理的若干部分的另一个技术。分区针对的是数据的存储路径；分桶针对的是数据文件。 1．先创建分桶表，通过直接导入数据文件的方式（1）数据准备

1001 ss1
1002 ss2
1003 ss3
1004 ss4
1005 ss5
1006 ss6
1007 ss7
1008 ss8
1009 ss9
1010 ss10
1011 ss11
1012 ss12
1013 ss13
1014 ss14
1015 ss15
1016 ss16

（2）创建分桶表
create table stu_buck(id int, name string)
clustered by(id) 
into 4 buckets
row format delimited fields terminated by '\t';
（3）查看表结构
hive (default)> desc formatted stu_buck;
Num Buckets: 4 
（4）导入数据到分桶表中
hive (default)> load data local inpath '/opt/module/datas/student.txt' into table
stu_buck;

（5）查看创建的分桶表中是否分成 4 个桶，如图 6-7 所示

发现并没有分成 4 个桶。是什么原因呢？ 2．创建分桶表时，数据通过子查询的方式导入

（1）先建一个普通的 stu 表
create table stu(id int, name string)
row format delimited fields terminated by '\t';
（2）向普通的 stu 表中导入数据
load data local inpath '/opt/module/datas/student.txt' into table stu;
（3）清空 stu_buck 表中数据
truncate table stu_buck;
select * from stu_buck;
（4）导入数据到分桶表，通过子查询的方式
insert into table stu_buck
select id, name from stu;

（5）发现还是只有一个分桶，如图 6-8 所示

（6）需要设置一个属性
hive (default)> set hive.enforce.bucketing=true;
hive (default)> set mapreduce.job.reduces=-1;
hive (default)> insert into table stu_buck
select id, name from stu;

（7）查询分桶的数据

select * from stu_buck;

分桶规则：根据结果可知：Hive 的分桶采用对分桶字段的值进行哈希，然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中 6.2 分桶抽样查询 对于非常大的数据集，有时用户需要使用的是一个具有代表性的查询结果而不是全部结果。Hive 可以通过对表进行抽样来满足这个需求。查询表 stu_buck 中的数据。

hive (default)> select * from stu_buck tablesample(bucket 1 out of 4 on id);

注：tablesample 是抽样语句，语法：TABLESAMPLE(BUCKET x OUT OF y) 。 y 必须是 table 总 bucket 数的倍数或者因子。hive 根据 y 的大小，决定抽样的比例。例如，table 总共分了 4 份，当 y=2 时，抽取(4/2=)2 个 bucket 的数据，当 y=8 时，抽取(4/8=)1/2 个 bucket 的数据。 x 表示从哪个 bucket 开始抽取，如果需要取多个分区，以后的分区号为当前分区号加上y。例如，table 总 bucket 数为 4，tablesample(bucket 1 out of 2)，表示总共抽取（4/2=）2 个bucket 的数据，抽取第 1(x)个和第 3(x+y)个 bucket 的数据。注意：x 的值必须小于等于 y 的值，否则 FAILED: SemanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_buck

7 其他常用查询函数

7.1 空字段赋值 1. 函数说明 NVL：给值为 NULL 的数据赋值，它的格式是 NVL( value，default_value)。它的功能是如果 value 为 NULL，则 NVL 函数返回 default_value 的值，否则返回 value 的值，如果两个参数都为 NULL ，则返回 NULL。 2. 数据准备：采用员工表 3. 查询：如果员工的 comm 为 NULL，则用-1 代替

hive (default)> select comm,nvl(comm, -1) from emp;
OK
comm _c1
NULL -1.0
300.0 300.0
500.0 500.0
NULL -1.0
1400.0 1400.0
NULL -1.0
NULL -1.0
NULL -1.0
NULL -1.0
0.0 0.0
NULL -1.0
NULL -1.0
NULL -1.0
NULL -1.0

4. 查询：如果员工的 comm 为 NULL，则用领导 id 代替

hive (default)> select comm, nvl(comm,mgr) from emp;
OK
comm _c1
NULL 7902.0
300.0 300.0
500.0 500.0
NULL 7839.0
1400.0 1400.0
NULL 7839.0
NULL 7839.0
NULL 7566.0
NULL NULL
0.0 0.0
NULL 7788.0
NULL 7698.0
NULL 7566.0
NULL 7782.0

7.2 CASE WHEN 1. 数据准备

2．需求求出不同部门男女各多少人。结果如下：

A 2 1
B 1 2

3．创建本地 emp_sex.txt，导入数据

[atguigu@hadoop102 datas]$ vi emp_sex.txt
悟空 A 男
大海 A 男
宋宋 B 男
凤姐 A 女
婷姐 B 女
婷婷 B 女

4．创建 hive 表并导入数据

create table emp_sex(
name string, 
dept_id string, 
sex string) 
row format delimited fields terminated by "\t";
load data local inpath '/opt/module/datas/emp_sex.txt' into table emp_sex;

5．按需求查询数据

select 
 dept_id,
 sum(case sex when '男' then 1 else 0 end) male_count,
 sum(case sex when '女' then 1 else 0 end) female_count
from
 emp_sex
group by
 dept_id;

7.3 行转列 1．相关函数说明 CONCAT(string A/col, string B/col…)：返回输入字符串连接后的结果，支持任意个输入字符串; CONCAT_WS(separator, str1, str2,...)：它是一个特殊形式的 CONCAT()。第一个参数剩余参数间的分隔符。分隔符可以是与剩余参数一样的字符串。如果分隔符是 NULL，返回值也将为 NULL。这个函数会跳过分隔符参数后的任何 NULL 和空字符串。分隔符将被加到被连接的字符串之间; COLLECT_SET(col)：函数只接受基本数据类型，它的主要作用是将某字段的值进行去重汇总，产生 array 类型字段。 2．数据准备

3．需求把星座和血型一样的人归类到一起。结果如下：

射手座,A 大海|凤姐
白羊座,A 孙悟空|猪八戒
白羊座,B 宋宋

4．创建本地 constellation.txt，导入数据

[atguigu@hadoop102 datas]$ vi constellation.txt
孙悟空 白羊座 A
大海 射手座 A
宋宋 白羊座 B
猪八戒 白羊座 A
凤姐 射手座 A

5．创建 hive 表并导入数据

create table person_info(
name string, 
constellation string, 
blood_type string) 
row format delimited fields terminated by "\t";
load data local inpath "/opt/module/datas/constellation.txt" into table person_info;

6．按需求查询数据

select
 t1.base,
 concat_ws('|', collect_set(t1.name)) name
from
 (select
 name,
 concat(constellation, ",", blood_type) base
 from
 person_info) t1
group by
 t1.base;

7.4 列转行

1．函数说明 EXPLODE(col)：将 hive 一列中复杂的 array 或者 map 结构拆分成多行。 LATERAL VIEW 用法：LATERAL VIEW udtf(expression) tableAlias AS columnAlias 解释：用于和 split, explode 等 UDTF 一起使用，它能够将一列数据拆成多行数据，在此 基础上可以对拆分后的数据进行聚合。 2．数据准备

3．需求将电影分类中的数组数据展开。结果如下：

《疑犯追踪》 悬疑
《疑犯追踪》 动作
《疑犯追踪》 科幻
《疑犯追踪》 剧情
《Lie to me》 悬疑
《Lie to me》 警匪
《Lie to me》 动作
《Lie to me》 心理
《Lie to me》 剧情
《战狼 2》 战争
《战狼 2》 动作
《战狼 2》 灾难

4．创建本地 movie.txt，导入数据

[atguigu@hadoop102 datas]$ vi movie.txt
《疑犯追踪》 悬疑,动作,科幻,剧情
《Lie to me》 悬疑,警匪,动作,心理,剧情
《战狼 2》 战争,动作,灾难

5．创建 hive 表并导入数据

create table movie_info(
 movie string, 
 category array<string>) 
row format delimited fields terminated by "\t"
collection items terminated by ",";
load data local inpath "/opt/module/datas/movie.txt" into table movie_info;

6．按需求查询数据

select
 movie,
 category_name
from 
 movie_info lateral view explode(category) table_tmp as category_name;

7.5 窗口函数（开窗函数） 1．相关函数说明 OVER()：指定分析函数工作的数据窗口大小，这个数据窗口大小可能会随着行的变而变化。 CURRENT ROW：当前行 n PRECEDING：往前 n 行数据 n FOLLOWING：往后 n 行数据 UNBOUNDED：起点，UNBOUNDED PRECEDING 表示从前面的起点， UNBOUNDED FOLLOWING 表示到后面的终点 LAG(col,n,default_val)：往前第 n 行数据 LEAD(col,n, default_val)：往后第 n 行数据 NTILE(n)：把有序分区中的行分发到指定数据的组中，各个组有编号，编号从 1 开始， 对于每一行，NTILE 返回此行所属的组的编号。注意：n 必须为 int 类型。 2．数据准备：name，orderdate，cost

jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94

3．需求（1）查询在 2017 年 4 月份购买过的顾客及总人数（2）查询顾客的购买明细及月购买总额（3）上述的场景,要将 cost 按照日期进行累加（4）查询每个顾客上次的购买时间（5）查询前 20%时间的订单信息 4．创建本地 business.txt，导入数据

[atguigu@hadoop102 datas]$ vi business.txt

5．创建 hive 表并导入数据

create table business(
name string, 
orderdate string,
cost int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
load data local inpath "/opt/module/datas/business.txt" into table business;

6．按需求查询数据（1）查询在 2017 年 4 月份购买过的顾客及总人数

select name,count(*) over () 
from business
where substring(orderdate,1,7) = '2017-04' 
group by name;

（2）查询顾客的购买明细及月购买总额

select name,orderdate,cost,sum(cost) over(partition by month(orderdate)) from
business;

（3）上述的场景, 将每个顾客的 cost 按照日期进行累加

select name,orderdate,cost, 
sum(cost) over() as sample1,--所有行相加
sum(cost) over(partition by name) as sample2,--按 name 分组，组内数据相加
sum(cost) over(partition by name order by orderdate) as sample3,--按 name 分组，组内数
据累加
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING 
and current row ) as sample4 ,--和 sample3 一样,由起点到当前行的聚合
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and 
current row) as sample5, --当前行和前面一行做聚合
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 
FOLLOWING ) as sample6,--当前行和前边一行及后面一行
sum(cost) over(partition by name order by orderdate rows between current row and 
UNBOUNDED FOLLOWING ) as sample7 --当前行及后面所有行
from business;

rows 必须跟在 Order by 子句之后，对排序的结果进行限制，使用固定的行数来限制分区中的数据行数量（4）查看顾客上次的购买时间

select name,orderdate,cost, 
lag(orderdate,1,'1900-01-01') over(partition by name order by orderdate ) as time1, 
lag(orderdate,2) over (partition by name order by orderdate) as time2 
from business;

（5）查询前 20%时间的订单信息

select * from (
 select name,orderdate,cost, ntile(5) over(order by orderdate) sorted
 from business
) t
where sorted = 1;

7.5 Rank 1．函数说明 RANK() 排序相同时会重复，总数不会变 DENSE_RANK() 排序相同时会重复，总数会减少 ROW_NUMBER() 会根据顺序计算 2．数据准备

3．需求计算每门学科成绩排名。 4．创建本地 score.txt，导入数据

create table score(
name string,
subject string, 
score int) 
row format delimited fields terminated by "\t";
load data local inpath '/opt/module/datas/score.txt' into table score;

6．按需求查询数据

select name,
subject,
score,
rank() over(partition by subject order by score desc) rp,
dense_rank() over(partition by subject order by score desc) drp,
row_number() over(partition by subject order by score desc) rmp
from score;
name subject score rp drp rmp
孙悟空 数学 95 1 1 1
宋宋 数学 86 2 2 2
婷婷 数学 85 3 3 3
大海 数学 56 4 4 4
宋宋 英语 84 1 1 1
大海 英语 84 1 1 2
婷婷 英语 78 3 2 3
孙悟空 英语 68 4 3 4
大海 语文 94 1 1 1
孙悟空 语文 87 2 2 2
婷婷 语文 65 3 3 3
宋宋 语文 64 4 4 4

标签：name,分桶,Hive,查询,stu,table,NULL,数据,select
来源： https://www.cnblogs.com/qiu-hua/p/14877910.html