数据库
首页 > 数据库> > Java-Spark SQL:嵌套类导致拼花错误

Java-Spark SQL:嵌套类导致拼花错误

作者:互联网

我似乎无法在镶木地板上写JavaRDD< T>.其中T代表Person类.我将其定义为

public class Person implements Serializable
{
    private static final long serialVersionUID = 1L;
    private String name;
    private String age;
    private Address address;
....

地址:

public class Address implements Serializable
{
    private static final long serialVersionUID = 1L;
    private String City; private String Block;
    ...<getters and setters>

然后,我像这样创建一个JavaRDD:

JavaRDD<Person> people = sc.textFile("/user/johndoe/spark/data/people.txt").map(new Function<String, Person>()
    {
        public Person call(String line)
        {
            String[] parts = line.split(",");
            Person person = new Person();
            person.setName(parts[0]);
            person.setAge("2");
            Address address = new Address("HomeAdd","141H");
            person.setAddress(address);
            return person;
        }
    });

注意-我手动将所有地址设置为相同.这基本上是嵌套的RDD.在尝试将其另存为实木复合地板文件时:

DataFrame dfschemaPeople = sqlContext.createDataFrame(people, Person.class);
dfschemaPeople.write().parquet("/user/johndoe/spark/data/out/people.parquet");    

地址类别为:

import java.io.Serializable;
public class Address implements Serializable
{
    public Address(String city, String block)
    {
        super();
        City = city;
        Block = block;
    }
    private static final long serialVersionUID = 1L;
    private String City;
    private String Block;
    //Omitting getters and setters
}

我遇到错误:

由以下原因引起:java.lang.ClassCastException:com.test.schema.Address无法转换为org.apache.spark.sql.Row

我正在运行spark-1.4.1.

>这是一个已知的错误吗?
>如果通过导入相同格式的嵌套JSON文件来执行相同操作,则可以保存到镶木地板中.
>即使我创建了一个子数据框,如:DataFrame dfSubset = sqlContext.sql(“ SELECT address.city FROM PersonTable”);我仍然遇到相同的错误

那有什么呢?如何从文本文件读取复杂的数据结构并另存为实木复合地板?看来我做不到.

解决方法:

您正在使用具有限制的Java API

来自spark文档:
http://spark.apache.org/docs/1.4.1/sql-programming-guide.html#interoperating-with-rdds

Spark SQL支持将JavaBean的RDD自动转换为DataFrame.使用反射获得的BeanInfo定义表的架构.当前,Spark SQL不支持包含嵌套或包含复杂类型(例如列表或数组)的JavaBean.您可以通过创建一个实现Serializable并具有其所有字段的getter和setter方法的类来创建JavaBean.
使用scala case类将起作用(已更新为以拼花形式写入)

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD

case class Address(city:String, block:String);
case class Person(name:String,age:String, address:Address);
object Test2 {
  def main(args: Array[String]): Unit = {

     val conf = new SparkConf().setAppName("Simple Application").setMaster("local");
      val sc = new SparkContext(conf)
      val sqlContext = new org.apache.spark.sql.SQLContext(sc);
      import sqlContext.implicits._
      val people = sc.parallelize(List(Person("a", "b", Address("a", "b")), Person("c", "d", Address("c", "d"))));

      val df  = sqlContext.createDataFrame(people);
      df.write.mode("overwrite").parquet("/tmp/people.parquet")
  }
}

标签:apache-spark,apache-spark-sql,java,parquet
来源: https://codeday.me/bug/20191119/2032302.html