自定义UDF之自定义标识分组
作者:互联网
**
自定义UDF之自定义标识分组
**
功能:根据字段匹配自行分组
首先添加maven依赖,我使用的hive版本是2.3.5,根据自己需求自己更改版本
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.atweimiao.udf</groupId>
<artifactId>selfgroup</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>2.3.5</version>
</dependency>
</dependencies>
</project>
创建实现类
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.apache.hadoop.io.IntWritable;
/**
* 自定义UDF函数,需要继承GenericUDF类
*
*/
public class SelfGroup extends GenericUDF {
private transient ObjectInspectorConverters.Converter[] converters;
/**
*
* @param arguments 输入参数类型的鉴别器对象
* @return 返回值类型的鉴别器对象
* @throws UDFArgumentException
*/
public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
for (int i = 0; i < arguments.length; i++) {
ObjectInspector.Category category = arguments[i].getCategory();
if (category != ObjectInspector.Category.PRIMITIVE) {
throw new UDFArgumentTypeException(i, "The "
+ GenericUDFUtils.getOrdinal(i + 1)
+ " argument of function LOCATE is expected to a "
+ ObjectInspector.Category.PRIMITIVE.toString().toLowerCase() + " type, but "
+ category.toString().toLowerCase() + " is found");
}
}
converters = new ObjectInspectorConverters.Converter[arguments.length];
for (int i = 0; i < arguments.length; i++) {
converters[i] = ObjectInspectorConverters.getConverter(arguments[i],
PrimitiveObjectInspectorFactory.writableStringObjectInspector);
}
return PrimitiveObjectInspectorFactory.writableIntObjectInspector;
}
/**
* 函数的核心处理方法
* @param arguments 传入到函数中的参数.
* @return 函数的返回值
* @throws HiveException
*/
private final IntWritable intWritable = new IntWritable(0);
int flag = 0;
public Object evaluate(DeferredObject[] arguments) throws HiveException {
if (arguments[0].get() == null || arguments[1].get() == null) {
return null;
}
String fir = arguments[0].get().toString();
String sen = arguments[1].get().toString();
if(fir.equals(sen)){
flag=flag+1;
}
intWritable.set(flag);
return intWritable;
}
public String getDisplayString(String[] children) {
return "";
}
}
完成之后打成jar包放到hive目录下的lib目录下,使用时添加相应jar包,创建临时函数即可使用
操作如下,添加jar包,
add jar /opt/module/hive/datas/selfgroup.jar;
之后创建临时函数,格式: create temporary function 自定义方法名 as “全类名”,如下
create temporary function my_group as "com.test.hive.SelfGroup";
然后就可以使用了,效果如下
如果不是机房,是云环境是要将jar包传到hdfs的对应目录下引用才能正常使用
标签:自定义,UDF,hive,hadoop,分组,arguments,import,apache,org 来源: https://blog.csdn.net/m0_37606374/article/details/121460956