首页 > 其他分享> > Partition分区

Partition分区

2022-07-22 21:01:13 作者：互联网

默认分区

默认分区是根据key的hashCode对ReduceTasks个数取模得到的。用户没法控制哪个key存储到哪个分区。

自定义分区

（1）自定义类继承Partitioner，重写getPartition() 方法

（2）在Job驱动中，设置自定义Partitioner

（3）自定义Patition后，要根据自定义Partitioner的逻辑设置相应数量的ReduceTask

分区总结

（1）如果ReduceTask的数量 > getPartition的结果数，则会多产生几个空的输出文件（这样会浪费资源）。

（2）如果 1 < ReduceTask 的数量 < getPartition的结果数，则有一部分分区数据无处安放，会报异常。

（3）如果ReduceTask的数量 = 1，则不管MapTask端输出多少个分区文件，最终结果都交给这一个ReduceTask，最终也就只会产生一个结果文件part-r-00000；

（4）分区号必须从零开始，逐一累加。

例子：

自定义

 1 import org.apache.hadoop.io.Text;
 2 import org.apache.hadoop.mapreduce.Partitioner;
 3 
 4 public class ProvincePartitioner extends Partitioner<Text, FlowBean> {
 5     /**
 6      * @param text          the key to be partioned.
 7      * @param flowBean      the entry value.
 8      * @param numPartitions the total number of partitions.
 9      * @return the partition number for the <code>key</code>.
10      */
11     @Override
12     public int getPartition(Text text, FlowBean flowBean, int numPartitions) {
13 
14         //text 是手机号
15         String phone = text.toString();
16 
17         // 去前三位
18         String phone1 = phone.substring(0, 3);
19 
20         int partition;
21         if ("136".equals(phone1)) {
22             partition = 0;
23         } else if ("137".equals(phone1)) {
24             partition = 1;
25         }else if ("138".equals(phone1)) {
26             partition = 2;
27         }else if ("139".equals(phone1)) {
28             partition =3;
29         }else {
30             partition = 4;
31         }
32         return partition;
33     }
34 }

设置分区数

        // 设置分区数, 5个
        job.setPartitionerClass(ProvincePartitioner.class);
        job.setNumReduceTasks(5);

结果

标签：phone1,自定义,分区,Partition,partition,Partitioner,ReduceTask
来源： https://www.cnblogs.com/xiao-wang-tong-xue/p/16507968.html