首页 > 编程语言> > 从PySpark DataFrame中的Python列表中删除元素

从PySpark DataFrame中的Python列表中删除元素

2019-07-17 09:54:48 作者：互联网

我试图从Python列表中删除一个元素：

+---------------+
|        sources|
+---------------+
|           [62]|
|        [7, 32]|
|           [62]|
|   [18, 36, 62]|
|[7, 31, 36, 62]|
|    [7, 32, 62]|

我希望能够从上面列表中的每个列表中删除元素rm.我写了一个函数,可以为列表列表做到这一点：

def asdf(df, rm):
    temp = df
    for n in range(len(df)):
        temp[n] = [x for x in df[n] if x != rm]
    return(temp)

删除rm = 1：

a = [[1,2,3],[1,2,3,4],[1,2,3,4,5]]
In:  asdf(a,1)
Out: [[2, 3], [2, 3, 4], [2, 3, 4, 5]]

但我不能让它适用于DataFrame：

asdfUDF = udf(asdf, ArrayType(IntegerType()))

In: df.withColumn("src_ex", asdfUDF("sources", 32))

Out: Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col. Trace:
py4j.Py4JException: Method col([class java.lang.Integer]) does not exist

期望的行为：

In: df.withColumn("src_ex", asdfUDF("sources", 32))
Out: 

+---------------+
|         src_ex|
+---------------+
|           [62]|
|            [7]|
|           [62]|
|   [18, 36, 62]|
|[7, 31, 36, 62]|
|        [7, 62]|

(除了将上面的新列附加到PySpark DataFrame之外,df)

有什么建议或想法吗？

解决方法:

Spark> = 2.4

你可以使用array_remove：

from pyspark.sql.functions import array_remove

df.withColumn("src_ex", array_remove("sources", 32)).show()

+---------------+---------------+
|        sources|         src_ex|
+---------------+---------------+
|           [62]|           [62]|
|        [7, 32]|            [7]|
|           [62]|           [62]|
|   [18, 36, 62]|   [18, 36, 62]|
|[7, 31, 36, 62]|[7, 31, 36, 62]|
|    [7, 32, 62]|        [7, 62]|
+---------------+---------------+

或过滤：

from pyspark.sql.functions import expr

df.withColumn("src_ex", expr("filter(sources, x -> not(x <=> 32))")).show()

+---------------+---------------+
|        sources|         src_ex|
+---------------+---------------+
|           [62]|           [62]|
|        [7, 32]|            [7]|
|           [62]|           [62]|
|   [18, 36, 62]|   [18, 36, 62]|
|[7, 31, 36, 62]|[7, 31, 36, 62]|
|    [7, 32, 62]|        [7, 62]|
+---------------+---------------+

火花< 2.4 很多事情：
> DataFrame不是列表列表.实际上它甚至不是普通的Python对象,它没有len而且它不是Iterable.
>您看起来像普通数组类型的列.
>您不能引用DataFrame(或UDF中的任何其他分布式数据结构).
>直接传递给UDF调用的每个参数都必须是str(列名)或Column对象.传递文字使用点亮功能.

唯一剩下的就是列表理解：

from pyspark.sql.functions import lit, udf

def drop_from_array_(arr, item):
    return [x for x in arr if x != item]

drop_from_array = udf(drop_from_array_, ArrayType(IntegerType()))

用法示例：

df = sc.parallelize([
    [62], [7, 32], [62], [18, 36, 62], [7, 31, 36, 62], [7, 32, 62]
]).map(lambda x: (x, )).toDF(["sources"])

df.withColumn("src_ex", drop_from_array("sources", lit(32)))

结果：

+---------------+---------------+
|        sources|         src_ex|
+---------------+---------------+
|           [62]|           [62]|
|        [7, 32]|            [7]|
|           [62]|           [62]|
|   [18, 36, 62]|   [18, 36, 62]|
|[7, 31, 36, 62]|[7, 31, 36, 62]|
|    [7, 32, 62]|        [7, 62]|
+---------------+---------------+

标签：python,apache-spark,apache-spark-sql,pyspark,pyspark-sql
来源： https://codeday.me/bug/20190717/1487370.html