首页 > 编程语言> > python-将PySpark数据框列类型转换为字符串并替换方括号

python-将PySpark数据框列类型转换为字符串并替换方括号

2019-10-26 11:56:29 作者：互联网

我需要将PySpark df列类型从数组转换为字符串,还要删除方括号.这是数据框的架构.需要处理的列是CurrencyCode和TicketAmount

>>> plan_queryDF.printSchema()

root
 |-- event_type: string (nullable = true)
 |-- publishedDate: string (nullable = true)
 |-- plannedCustomerChoiceID: string (nullable = true)
 |-- assortedCustomerChoiceID: string (nullable = true)
 |-- CurrencyCode: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- TicketAmount: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- currentPlan: boolean (nullable = true)
 |-- originalPlan: boolean (nullable = true)
 |-- globalId: string (nullable = true)
 |-- PlanJsonData: string (nullable = true)

来自数据框的样本数据

+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
|          event_type|       publishedDate|plannedCustomerChoiceID|assortedCustomerChoiceID|CurrencyCode|TicketAmount|currentPlan|originalPlan|    globalId|        PlanJsonData|
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
|PlannedCustomerCh...|2016-08-23T04:46:...|   087d1ff1-5f3a-496...|    2539cc4a-37e5-4f3...|       [GBP]|         [0]|      false|       false|000576015000|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T04:30:...|   0a1af217-d1e8-4ab...|    61bc5fda-0160-484...|       [CNY]|       [329]|       true|       false|000189668017|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T05:49:...|   1028b477-f93e-47f...|    c6d5b761-94f2-454...|       [JPY]|      [3400]|       true|       false|000576058003|{"httpStatus":200...|

我该怎么做？目前,我正在执行字符串转换,然后用regexp_replace替换方括号.但是当我处理大量数据时,这种方法会失败.

我还有其他方法可以做到吗？

这就是我要的.

+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
|          event_type|       publishedDate|plannedCustomerChoiceID|assortedCustomerChoiceID|CurrencyCode|TicketAmount|currentPlan|originalPlan|    globalId|        PlanJsonData|
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
|PlannedCustomerCh...|2016-08-23T04:46:...|   087d1ff1-5f3a-496...|    2539cc4a-37e5-4f3...|       GBP|         0|      false|       false|000576015000|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T04:30:...|   0a1af217-d1e8-4ab...|    61bc5fda-0160-484...|       CNY|       329|       true|       false|000189668017|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T05:49:...|   1028b477-f93e-47f...|    c6d5b761-94f2-454...|       JPY|      3400|       true|       false|000576058003|{"httpStatus":200...|

解决方法:

您可以尝试getItem(0)：

df \
    .withColumn("CurrencyCode", df["CurrencyCode"].getItem(0).cast("string")) \
    .withColumn("TicketAmount", df["TicketAmount"].getItem(0).cast("string"))

最后强制转换为字符串是可选的.

标签：pyspark,apache-spark-sql,python
来源： https://codeday.me/bug/20191026/1936463.html