首页 > 其他分享> > Hive的内部表、外部表、分区表和分桶表

Hive的内部表、外部表、分区表和分桶表

2021-09-28 20:01:14 作者：互联网

hive是基于Hadoop的一个数据仓库工具，用来进行数据提取、转化、加载，这是一种可以存储、查询和分析存储在Hadoop中的大规模数据的机制。hive数据仓库工具能将结构化的数据文件映射为一张数据库表，并提供SQL查询功能，能将SQL语句转变成MapReduce任务来执行。Hive的优点是学习成本低，可以通过类似SQL语句实现快速MapReduce统计，使MapReduce变得更加简单，而不必开发专门的MapReduce应用程序。hive十分适合对数据仓库进行统计分析。

hive中包含以下四类数据模型：表(Table)、外部表(External Table)、分区(Partition)、桶(Bucket)。

一、Hive内部表、外部表

1.内部表

 create table test (name string , age string) location '/input/table_data';

注：hive默认创建的是内部表
此时，会在hdfs上新建一个test表的数据存放地

 load data inpath '/input/data' into table test ;

会将hdfs上的/input/data目录下的数据转移到/input/table_data目录下。删除test表后，会将test表的数据和元数据信息全部删除，即最后/input/table_data下无数据，当然/input/data下再上一步已经没有了数据！

如果创建内部表时没有指定location，就会在/user/Hive/warehouse/下新建一个表目录，其余情况同上。

2.外部表

create external table etest (name string , age string);

会在/user/hive/warehouse/新建一个表目录et

load data inpath '/input/edata' into table etest;

把hdfs上/input/edata/下的数据转到/user/hive/warehouse/et下，删除这个外部表后，/user/hive/warehouse/et下的数据不会删除，但是/input/edata/下的数据在上一步load后已经没有了！数据的位置发生了变化！

3.内部表与外部表区别：

（1）在导入数据到外部表，数据并没有移动到自己的数据仓库目录下(如果指定了location的话)，也就是说外部表中的数据并不是由它自己来管理的！而内部表则不一样；
（2）在删除内部表的时候，Hive将会把属于表的元数据和数据全部删掉；而删除外部表的时候，Hive仅仅删除外部表的元数据，数据是不会删除的！

（3）在创建内部表或外部表时加上location 的效果是一样的，只不过表目录的位置不同而已，加上partition用法也一样，只不过表目录下会有分区目录而已，load data local inpath直接把本地文件系统的数据上传到hdfs上，有location上传到location指定的位置上，没有的话上传到hive默认配置的数据仓库中。外部表相对来说更加安全些，数据组织也更加灵活，方便共享源数据。

二、Hive分区表、分桶表

1.分区表

创建分区表语法

create table score(s_id string,c_id string,s_score int) partitioned by (month string) row format delimited fields terminated by '\t';

创建一个表带有多个分区

create table score2(s_id string,c_id string,s_score int)
partitioned by (year string,month string,day string)
row format delimited fields terminated by '\t';

加载数据到分区表当中去

load data local inpath '/bigdata/logs/score.csv' into table score partition(month='201806');

查看分区

show partitions score;

添加一个分区

alter table score add partition(month='201805');

同时添加多个分区

alter table score add partition(month='201804') partition(month='201803');

删除分区

alter table score drop partition(month='201806');

2.分桶表

创建分桶表

set hive.enforce.bucketing=true;
set mapreduce.job.reduces=4;  ##分4个桶

##创建分桶表
create table user_buckets_demo(id int,name string)
clustered by(id) into 4 buckets
row format delimited fields terminated by '\t';

##创建普通表
create table user_demo(id int, name string)
row format delimited fields terminated by '\t';

准备数据文件user_bucket.txt

cd /bigdata/logs/
vi user_bucket.txt
1 anzhulababy1
2 AngleBaby2
3 xiaoxuanfeng1
4 heixuanfeng1
5 jingmaoshu1
6 dongjingshu1
7 dianwanxiaozi1
8 aiguozhe1

加载数据到普通表user_buckets_demo中

load data local inpath '/bigdata/logs/user_bucket.txt' 
overwrite into table user_buckets_demo;

3.分区表和分桶表区别：

（1）从表现形式上：
分区表是一个目录，分桶表是文件

（2）从创建语句上：
分区表使用partitioned by 子句指定，以指定字段为伪列，需要指定字段类型
分桶表由clustered by 子句指定，指定字段为真实字段，需要指定桶的个数

（3）从数量上：
分区表的分区个数可以增长，分桶表一旦指定，不能再增长

（4）从作用上：
        分区避免全表扫描，根据分区列查询指定目录提高查询速度
        分桶保存分桶查询结果的分桶结构（数据已经按照分桶字段进行了hash散列）。
        分桶表数据进行抽样和JOIN时可以提高MR程序效率

标签：string,分桶,Hive,分区表,user,table,data
来源： https://blog.csdn.net/cc876/article/details/120535810