Spark-UDF

2020-02-01 17:56:29 作者：互联网

User Defined Functions - Python

This document contains an examples of creating a UDF in Python and registering it for use in Spark SQL.

What is Pyspark UDF

Pyspark UDF is a user defined function executed in Python runtime.

Two types

Row UDF

lambda x: x+1

lambda data1, data2: (data1 - data2).year

Group UDF :

lambda values: np.mean(np.array(values))

So the row UDF, it’s similar to what you do in Spark with the map operator and pressing a function.

Group UDF, while it’s similar to a row UDF, except you want to do it on a list of values. So it’s similar to do a groupBy followed by a map operator in Scala.

Define a function , register the function as a UDF and set the return type of your UDF.

1
2
3

def (s):
  return s * s
sqlContext.udf.register("squaredWithPython", squared, LongType())

Call the UDF in Spark SQL

1 2	sqlContext.range(1, 20).registerTempTable("test") result = spark.sql("select id, squaredWithPython(id) as id_squared from test")

Use UDF with DataFrames

%sql select * FROM test
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf

spark = SparkSession 
    .builder 
    .appName("Python Spark SQL basic example") 
    .config("spark.some.config.option", "some-value") 
    .getOrCreate()
    

squared_udf = udf(squared, LongType())
df = sqlContext.table("test")
display(df.select("id", squared_udf("id").alias("id_squared")))

User-Defined Aggregate Functions - Scala

标签：function,squared,UDF,sql,Spark,id
来源： https://www.cnblogs.com/liuzhongrong/p/12249251.html