其他分享
首页 > 其他分享> > hive学习笔记之六:HiveQL基础

hive学习笔记之六:HiveQL基础

作者:互联网

欢迎访问我的GitHub

https://github.com/zq2599/blog_demos

内容:所有原创文章分类汇总及配套源码,涉及Java、Docker、Kubernetes、DevOPS等;

《hive学习笔记》系列导航

  1. 基本数据类型
  2. 复杂数据类型
  3. 内部表和外部表
  4. 分区表
  5. 分桶
  6. HiveQL基础
  7. 内置函数
  8. Sqoop
  9. 基础UDF
  10. 用户自定义聚合函数(UDAF)
  11. UDTF

本篇概览

准备数据

  1. 本次实战要准备两个表:学生表和住址表,字段都很简单,如下图所示,学生表有个住址ID字段,是住址表里的记录的唯一ID:

在这里插入图片描述
2. 先创建住址表:

create table address (addressid int, province string, city string) 
row format delimited 
fields terminated by ',';
  1. 创建address.txt文件,内容如下:
1,guangdong,guangzhou
2,guangdong,shenzhen
3,shanxi,xian
4,shanxi,hanzhong
6,jiangshu,nanjing
  1. 加载数据到address表:
load data 
local inpath '/home/hadoop/temp/202010/25/address.txt' 
into table address;
  1. 创建学生表,其addressid字段关联了address表的addressid字段:
create table student (name string, age int, addressid int) 
row format delimited 
fields terminated by ',';
  1. 创建student.txt文件,内容如下:
tom,11,1
jerry,12,2
mike,13,3
john,14,4
mary,15,5
  1. 加载数据到student表:
load data 
local inpath '/home/hadoop/temp/202010/25/student.txt' 
into table student;
  1. 至此,本次操作所需数据已准备完毕,如下所示:
hive> select * from address;
OK
1	guangdong	guangzhou
2	guangdong	shenzhen
3	shanxi	xian
4	shanxi	hanzhong
6	jiangshu	nanjing
Time taken: 0.043 seconds, Fetched: 5 row(s)
hive> select * from student;
OK
tom	11	1
jerry	12	2
mike	13	3
john	14	4
mary	15	5
Time taken: 0.068 seconds, Fetched: 5 row(s)

select和where

最普通的带条件查询:

hive> select * from address where city like '%a%';
OK
1	guangdong	guangzhou
3	shanxi	xian
4	shanxi	hanzhong
6	jiangshu	nanjing
Time taken: 0.128 seconds, Fetched: 4 row(s)

group by

  1. province字段分组:
select province, count(*) from address group by province;

该查询会触发MR计算,结果如下:

...
Total MapReduce CPU Time Spent: 1 seconds 910 msec
OK
guangdong	2
jiangshu	1
shanxi	2
Time taken: 17.847 seconds, Fetched: 3 row(s)
  1. 试试嵌套查询,内部是查出city字段带有a字母的记录,然后将这些记录按照province字段分组:
select t.province, count(*) from (
    select * from address where city like '%a%'
) t 
group by t.province;

结果如下:

Total MapReduce CPU Time Spent: 1 seconds 760 msec
OK
guangdong	1
jiangshu	1
shanxi	2
Time taken: 18.036 seconds, Fetched: 3 row(s)

having

select t1.* from (
    select t.province, count(*) as cnt from (
        select * from address where city like '%a%'
    ) t 
group by t.province) t1 
where t1.cnt>1; 

结果如下,可见只有shanxi被显示了:

Total MapReduce CPU Time Spent: 2 seconds 250 msec
OK
shanxi	2
Time taken: 20.067 seconds, Fetched: 1 row(s)
select t.province, count(*) as cnt from (
        select * from address where city like '%a%'
    ) t 
group by t.province having cnt>1;

order by

select t.province, count(*) as cnt from (
        select * from address where city like '%a%'
    ) t 
group by t.province order by cnt;

会触发MR,结果如下:

Total MapReduce CPU Time Spent: 3 seconds 50 msec
OK
jiangshu	1
guangdong	1
shanxi	2
Time taken: 40.315 seconds, Fetched: 3 row(s)

内连接(inner join)

select 
  s.name, s.age, 
  a.province, a.city 
from 
  student s 
  inner join 
  address a 
on 
  s.addressid=a.addressid;

结果如下:

Total MapReduce CPU Time Spent: 1 seconds 20 msec
OK
tom	11	guangdong	guangzhou
jerry	12	guangdong	shenzhen
mike	13	shanxi	xian
john	14	shanxi	hanzhong
Time taken: 17.294 seconds, Fetched: 4 row(s)

自然连接(natural join)

select name, age, province, city from student natural join address;

结果如下,可见不会根据student表的addressid字段值去address查找记录,而是将addrerss的记录全部连接一次:

Total MapReduce CPU Time Spent: 940 msec
OK
tom	11	guangdong	guangzhou
jerry	12	guangdong	guangzhou
mike	13	guangdong	guangzhou
john	14	guangdong	guangzhou
mary	15	guangdong	guangzhou
tom	11	guangdong	shenzhen
jerry	12	guangdong	shenzhen
mike	13	guangdong	shenzhen
john	14	guangdong	shenzhen
mary	15	guangdong	shenzhen
tom	11	shanxi	xian
jerry	12	shanxi	xian
mike	13	shanxi	xian
john	14	shanxi	xian
mary	15	shanxi	xian
tom	11	shanxi	hanzhong
jerry	12	shanxi	hanzhong
mike	13	shanxi	hanzhong
john	14	shanxi	hanzhong
mary	15	shanxi	hanzhong
tom	11	jiangshu	nanjing
jerry	12	jiangshu	nanjing
mike	13	jiangshu	nanjing
john	14	jiangshu	nanjing
mary	15	jiangshu	nanjing
Time taken: 18.525 seconds, Fetched: 25 row(s)

左外连接(left outer join)

select 
  s.name, s.age, s.addressid, 
  a.province, a.city 
from 
  student s 
  left outer join 
  address a 
on 
  s.addressid=a.addressid;

结果如下,可见name=mary的记录,addressid等于5,在address中不存在addressid等于5的记录,因此province和city字段都展示了NULL,而在前面使用inner join时,结果中没有这条记录:

Total MapReduce CPU Time Spent: 950 msec
OK
tom	11	1	guangdong	guangzhou
jerry	12	2	guangdong	shenzhen
mike	13	3	shanxi	xian
john	14	4	shanxi	hanzhong
mary	15	5	NULL	NULL
Time taken: 18.442 seconds, Fetched: 5 row(s)

右外连接(right outer join)

和左连接类似,只不过是以右表为主,语法是right outer join

select 
  s.name, s.age, s.addressid, 
  a.province, a.city 
from 
  student s 
  right outer join 
  address a 
on 
  s.addressid=a.addressid;

结果如下,可见city=nanjing的记录,在student表中没有一条记录与之关联,因此结果中展示了address的字段,而student的字段为NULL:

Total MapReduce CPU Time Spent: 970 msec
OK
tom	11	1	guangdong	guangzhou
jerry	12	2	guangdong	shenzhen
mike	13	3	shanxi	xian
john	14	4	shanxi	hanzhong
NULL	NULL	NULL	jiangshu	nanjing
Time taken: 18.294 seconds, Fetched: 5 row(s)

全外连接(full outer join)

查询结果等于左外连接和右外连接之和,语法是full outer join

select 
  s.name, s.age, s.addressid, 
  a.province, a.city 
from 
  student s 
  full outer join 
  address a 
on 
  s.addressid=a.addressid;

结果如下:

Total MapReduce CPU Time Spent: 2 seconds 630 msec
OK
tom	11	1	guangdong	guangzhou
jerry	12	2	guangdong	shenzhen
mike	13	3	shanxi	xian
john	14	4	shanxi	hanzhong
mary	15	5	NULL	NULL
NULL	NULL	NULL	jiangshu	nanjing
Time taken: 22.189 seconds, Fetched: 6 row(s)

你不孤单,欣宸原创一路相伴

  1. Java系列
  2. Spring系列
  3. Docker系列
  4. kubernetes系列
  5. 数据库+中间件系列
  6. DevOps系列

欢迎关注公众号:程序员欣宸

微信搜索「程序员欣宸」,我是欣宸,期待与您一同畅游Java世界...
https://github.com/zq2599/blog_demos

标签:province,HiveQL,Time,之六,hive,address,guangdong,select,shanxi
来源: https://www.cnblogs.com/bolingcavalry/p/14970640.html