加入一个数据帧spark java
作者:互联网
首先,感谢您抽出时间阅读我的问题.
我的问题如下:在Spark with Java中,我在两个数据帧中加载了两个csv文件的数据.
这些数据框将具有以下信息.
Dataframe机场
Id | Name | City
-----------------------
1 | Barajas | Madrid
Dataframe airport_city_state
City | state
----------------
Madrid | España
我想加入这两个数据帧,使它看起来像这样:
数据帧结果
Id | Name | City | state
--------------------------
1 | Barajas | Madrid | España
其中dfairport.city = dfaiport_city_state.city
但是我无法用语法来澄清所以我可以正确地进行连接.我如何创建变量的一些代码:
// Load the csv, you have to specify that you have header and what delimiter you have
Dataset <Row> dfairport = Load.Csv (sqlContext, data_airport);
Dataset <Row> dfairport_city_state = Load.Csv (sqlContext, data_airport_city_state);
// Change the name of the columns in the csv dataframe to match the columns in the database
// Once they match the name we can insert them
Dfairport
.withColumnRenamed ("leg_key", "id")
.withColumnRenamed ("leg_name", "name")
.withColumnRenamed ("leg_city", "city")
dfairport_city_state
.withColumnRenamed("city", "ciudad")
.withColumnRenamed("state", "estado");
解决方法:
您可以使用带有列名的join方法来连接两个数据帧,例如:
Dataset <Row> dfairport = Load.Csv (sqlContext, data_airport);
Dataset <Row> dfairport_city_state = Load.Csv (sqlContext, data_airport_city_state);
Dataset <Row> joined = dfairport.join(dfairport_city_state, dfairport_city_state("City"));
还有一个重载版本,允许您将连接类型指定为第三个参数,例如:
数据集< Row>加入= dfairport.join(dfairport_city_state,dfairport_city_state(“City”),“left_outer”);
Here更多关于加入.
标签:java,dataframe,apache-spark,spark-dataframe 来源: https://codeday.me/bug/20190716/1474608.html