首页 > 数据库> > |NO.Z.00044|——————————|BigDataEnd|——|Hadoop&Spark.V05|------------------------------------------|Spa
|NO.Z.00044|——————————|BigDataEnd|——|Hadoop&Spark.V05|------------------------------------------|Spa
作者:互联网
[BigDataHadoop:Hadoop&Spark.V05] [BigDataHadoop.Spark内存级快速计算引擎][|章节四|Hadoop|spark|spark sql:spark sql编程&Transformation操作|]
一、Transformation 操作
### --- select * from tab where ... group by ... having... order by...
# --- 1、RDD类似的操作持久化
~~~ 缓存与checkpoint
~~~ select
~~~ where
~~~ group by / 聚合
~~~ order by
~~~ join
~~~ 集合操作
~~~ 空值操作(函数)
~~~ 函数
### --- 2、与RDD类似的操作
map、filter、flatMap、mapPartitions、sample、 randomSplit、
limit、distinct、dropDuplicates、describe
scala> df1.map(row=>row.getAs[Int](0)).show
+-----+
|value|
+-----+
| 7369|
| 7499|
| 7521|
| 7566|
| 7654|
| 7698|
| 7782|
| 7788|
| 7839|
| 7844|
| 7876|
| 7900|
| 7902|
| 7934|
+-----+
~~~ # randomSplit(与RDD类似,将DF、DS按给定参数分成多份)
scala> val df2 = df1.randomSplit(Array(0.5, 0.6, 0.7))
df2: Array[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]] = Array([EMPNO: int, ENAME: string ... 6 more fields], [EMPNO: int, ENAME: string ... 6 more fields], [EMPNO: int, ENAME: string ... 6 more fields])
scala> df2(0).count
res76: Long = 2
scala> df2(1).count
res77: Long = 4
scala> df2(2).count
res78: Long = 8
~~~ # 取10行数据生成新的DataSet
scala> val df2 = df1.limit(10)
df2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [EMPNO: int, ENAME: string ... 6 more fields]
~~~ # distinct,去重
scala> val df2 = df1.union(df1)
df2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [EMPNO: int, ENAME: string ... 6 more fields]
scala> df2.distinct.count
res79: Long = 14
~~~ # dropDuplicates,按列值去重
scala> df2.dropDuplicates.show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
+-----+------+---------+----+-------------------+----+----+------+
scala> df2.dropDuplicates("mgr", "deptno").show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
+-----+------+---------+----+-------------------+----+----+------+
scala> df2.dropDuplicates("mgr").show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df2.dropDuplicates("deptno").show
+-----+-----+--------+----+-------------------+----+----+------+
|EMPNO|ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+-----+--------+----+-------------------+----+----+------+
| 7369|SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7782|CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7499|ALLEN|SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
+-----+-----+--------+----+-------------------+----+----+------+
~~~ # 返回全部列的统计(count、mean、stddev、min、max)
scala> ds1.describe().show
+-------+----+----------------+------------------+
|summary|name| age| height|
+-------+----+----------------+------------------+
| count| 3| 3| 3|
| mean|null| 18.0|164.33333333333334|
| stddev|null|9.16515138991168|20.008331597945226|
| min|Andy| 10| 144|
| max| Tom| 28| 184|
+-------+----+----------------+------------------+
~~~ # 返回指定列的统计
scala> ds1.describe("*").show
+-------+----+----------------+------------------+
|summary|name| age| height|
+-------+----+----------------+------------------+
| count| 3| 3| 3|
| mean|null| 18.0|164.33333333333334|
| stddev|null|9.16515138991168|20.008331597945226|
| min|Andy| 10| 144|
| max| Tom| 28| 184|
+-------+----+----------------+------------------+
### --- 3、存储相关
~~~ cacheTable、persist、checkpoint、unpersist、cache
~~~ 备注:Dataset 默认的存储级别是 MEMORY_AND_DISK
scala> import org.apache.spark.storage.StorageLevel
import org.apache.spark.storage.StorageLevel
scala> spark.sparkContext.setCheckpointDir("hdfs://hadoop01:9000/checkpoint")
scala> df1.show()
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.checkpoint()
res36: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [EMPNO: int, ENAME: string ... 6 more fields]
scala> df1.cache()
res37: df1.type = [EMPNO: int, ENAME: string ... 6 more fields]
scala> df1.persist(StorageLevel.MEMORY_ONLY)
21/10/20 15:45:46 WARN CacheManager: Asked to cache already cached data.
res38: df1.type = [EMPNO: int, ENAME: string ... 6 more fields]
scala> df1.count()
res39: Long = 14
scala> df1.unpersist(true)
res40: df1.type = [EMPNO: int, ENAME: string ... 6 more fields]
scala> df1.createOrReplaceTempView("t1")
scala> spark.catalog.cacheTable("t1")
scala> spark.catalog.uncacheTable("t1")
### --- 4、select相关
~~~ 列的多种表示、select、selectExpr
~~~ drop、withColumn、withColumnRenamed、cast(内置函数)
~~~ # 列的多种表示方法。使用""、$""、'、col()、ds("")
~~~ # 注意:不要混用;必要时使用spark.implicitis._;并非每个表示在所有的地方都有效
scala> df1.select($"ename", $"hiredate", $"sal").show
+------+-------------------+----+
| ename| hiredate| sal|
+------+-------------------+----+
| SMITH|2001-01-02 22:12:13| 800|
| ALLEN|2002-01-02 22:12:13|1600|
| WARD|2003-01-02 22:12:13|1250|
| JONES|2004-01-02 22:12:13|2975|
|MARTIN|2005-01-02 22:12:13|1250|
| BLAKE|2005-04-02 22:12:13|2850|
| CLARK|2006-03-02 22:12:13|2450|
| SCOTT|2007-03-02 22:12:13|3000|
| KING|2006-03-02 22:12:13|5000|
|TURNER|2009-07-02 22:12:13|1500|
| ADAMS|2010-05-02 22:12:13|1100|
| JAMES|2011-06-02 22:12:13| 950|
| FORD|2011-07-02 22:12:13|3000|
|MILLER|2012-11-02 22:12:13|1300|
+------+-------------------+----+
scala> df1.select("ename", "hiredate", "sal").show
+------+-------------------+----+
| ename| hiredate| sal|
+------+-------------------+----+
| SMITH|2001-01-02 22:12:13| 800|
| ALLEN|2002-01-02 22:12:13|1600|
| WARD|2003-01-02 22:12:13|1250|
| JONES|2004-01-02 22:12:13|2975|
|MARTIN|2005-01-02 22:12:13|1250|
| BLAKE|2005-04-02 22:12:13|2850|
| CLARK|2006-03-02 22:12:13|2450|
| SCOTT|2007-03-02 22:12:13|3000|
| KING|2006-03-02 22:12:13|5000|
|TURNER|2009-07-02 22:12:13|1500|
| ADAMS|2010-05-02 22:12:13|1100|
| JAMES|2011-06-02 22:12:13| 950|
| FORD|2011-07-02 22:12:13|3000|
|MILLER|2012-11-02 22:12:13|1300|
+------+-------------------+----+
scala> df1.select('ename, 'hiredate, 'sal).show
+------+-------------------+----+
| ename| hiredate| sal|
+------+-------------------+----+
| SMITH|2001-01-02 22:12:13| 800|
| ALLEN|2002-01-02 22:12:13|1600|
| WARD|2003-01-02 22:12:13|1250|
| JONES|2004-01-02 22:12:13|2975|
|MARTIN|2005-01-02 22:12:13|1250|
| BLAKE|2005-04-02 22:12:13|2850|
| CLARK|2006-03-02 22:12:13|2450|
| SCOTT|2007-03-02 22:12:13|3000|
| KING|2006-03-02 22:12:13|5000|
|TURNER|2009-07-02 22:12:13|1500|
| ADAMS|2010-05-02 22:12:13|1100|
| JAMES|2011-06-02 22:12:13| 950|
| FORD|2011-07-02 22:12:13|3000|
|MILLER|2012-11-02 22:12:13|1300|
+------+-------------------+----+
scala> df1.select(col("ename"), col("hiredate"), col("sal")).show
+------+-------------------+----+
| ename| hiredate| sal|
+------+-------------------+----+
| SMITH|2001-01-02 22:12:13| 800|
| ALLEN|2002-01-02 22:12:13|1600|
| WARD|2003-01-02 22:12:13|1250|
| JONES|2004-01-02 22:12:13|2975|
|MARTIN|2005-01-02 22:12:13|1250|
| BLAKE|2005-04-02 22:12:13|2850|
| CLARK|2006-03-02 22:12:13|2450|
| SCOTT|2007-03-02 22:12:13|3000|
| KING|2006-03-02 22:12:13|5000|
|TURNER|2009-07-02 22:12:13|1500|
| ADAMS|2010-05-02 22:12:13|1100|
| JAMES|2011-06-02 22:12:13| 950|
| FORD|2011-07-02 22:12:13|3000|
|MILLER|2012-11-02 22:12:13|1300|
+------+-------------------+----+
scala> df1.select(df1("ename"), df1("hiredate"), df1("sal")).show
+------+-------------------+----+
| ename| hiredate| sal|
+------+-------------------+----+
| SMITH|2001-01-02 22:12:13| 800|
| ALLEN|2002-01-02 22:12:13|1600|
| WARD|2003-01-02 22:12:13|1250|
| JONES|2004-01-02 22:12:13|2975|
|MARTIN|2005-01-02 22:12:13|1250|
| BLAKE|2005-04-02 22:12:13|2850|
| CLARK|2006-03-02 22:12:13|2450|
| SCOTT|2007-03-02 22:12:13|3000|
| KING|2006-03-02 22:12:13|5000|
|TURNER|2009-07-02 22:12:13|1500|
| ADAMS|2010-05-02 22:12:13|1100|
| JAMES|2011-06-02 22:12:13| 950|
| FORD|2011-07-02 22:12:13|3000|
|MILLER|2012-11-02 22:12:13|1300|
+------+-------------------+----+
~~~ # 下面的写法无效,其他列的表示法有效
scala> df1.select("ename", "hiredate", "sal"+100).show
scala> df1.select("ename", "hiredate", "sal+100").show
~~~ # 这样写才符合语法
scala> df1.select($"ename", $"hiredate", $"sal"+100).show
+------+-------------------+-----------+
| ename| hiredate|(sal + 100)|
+------+-------------------+-----------+
| SMITH|2001-01-02 22:12:13| 900|
| ALLEN|2002-01-02 22:12:13| 1700|
| WARD|2003-01-02 22:12:13| 1350|
| JONES|2004-01-02 22:12:13| 3075|
|MARTIN|2005-01-02 22:12:13| 1350|
| BLAKE|2005-04-02 22:12:13| 2950|
| CLARK|2006-03-02 22:12:13| 2550|
| SCOTT|2007-03-02 22:12:13| 3100|
| KING|2006-03-02 22:12:13| 5100|
|TURNER|2009-07-02 22:12:13| 1600|
| ADAMS|2010-05-02 22:12:13| 1200|
| JAMES|2011-06-02 22:12:13| 1050|
| FORD|2011-07-02 22:12:13| 3100|
|MILLER|2012-11-02 22:12:13| 1400|
+------+-------------------+-----------+
scala> df1.select('ename, 'hiredate, 'sal+100).show
+------+-------------------+-----------+
| ename| hiredate|(sal + 100)|
+------+-------------------+-----------+
| SMITH|2001-01-02 22:12:13| 900|
| ALLEN|2002-01-02 22:12:13| 1700|
| WARD|2003-01-02 22:12:13| 1350|
| JONES|2004-01-02 22:12:13| 3075|
|MARTIN|2005-01-02 22:12:13| 1350|
| BLAKE|2005-04-02 22:12:13| 2950|
| CLARK|2006-03-02 22:12:13| 2550|
| SCOTT|2007-03-02 22:12:13| 3100|
| KING|2006-03-02 22:12:13| 5100|
|TURNER|2009-07-02 22:12:13| 1600|
| ADAMS|2010-05-02 22:12:13| 1200|
| JAMES|2011-06-02 22:12:13| 1050|
| FORD|2011-07-02 22:12:13| 3100|
|MILLER|2012-11-02 22:12:13| 1400|
+------+-------------------+-----------+
~~~ # 可使用expr表达式(expr里面只能使用引号)
scala> df1.select(expr("comm+100"), expr("sal+100"), expr("ename")).show
+------------+-----------+------+
|(comm + 100)|(sal + 100)| ename|
+------------+-----------+------+
| null| 900| SMITH|
| 400| 1700| ALLEN|
| 600| 1350| WARD|
| null| 3075| JONES|
| 1500| 1350|MARTIN|
| null| 2950| BLAKE|
| null| 2550| CLARK|
| null| 3100| SCOTT|
| null| 5100| KING|
| 100| 1600|TURNER|
| null| 1200| ADAMS|
| null| 1050| JAMES|
| null| 3100| FORD|
| null| 1400|MILLER|
+------------+-----------+------+
scala> df1.selectExpr("ename as name").show
+------+
| name|
+------+
| SMITH|
| ALLEN|
| WARD|
| JONES|
|MARTIN|
| BLAKE|
| CLARK|
| SCOTT|
| KING|
|TURNER|
| ADAMS|
| JAMES|
| FORD|
|MILLER|
+------+
scala> df1.selectExpr("power(sal, 2)", "sal").show
+---------------------------------------------+----+
|POWER(CAST(sal AS DOUBLE), CAST(2 AS DOUBLE))| sal|
+---------------------------------------------+----+
| 640000.0| 800|
| 2560000.0|1600|
| 1562500.0|1250|
| 8850625.0|2975|
| 1562500.0|1250|
| 8122500.0|2850|
| 6002500.0|2450|
| 9000000.0|3000|
| 2.5E7|5000|
| 2250000.0|1500|
| 1210000.0|1100|
| 902500.0| 950|
| 9000000.0|3000|
| 1690000.0|1300|
+---------------------------------------------+----+
scala> df1.selectExpr("round(sal, -3) as newsal", "sal", "ename").show
+------+----+------+
|newsal| sal| ename|
+------+----+------+
| 1000| 800| SMITH|
| 2000|1600| ALLEN|
| 1000|1250| WARD|
| 3000|2975| JONES|
| 1000|1250|MARTIN|
| 3000|2850| BLAKE|
| 2000|2450| CLARK|
| 3000|3000| SCOTT|
| 5000|5000| KING|
| 2000|1500|TURNER|
| 1000|1100| ADAMS|
| 1000| 950| JAMES|
| 3000|3000| FORD|
| 1000|1300|MILLER|
+------+----+------+
~~~ # drop、withColumn、 withColumnRenamed、casting
~~~ # drop 删除一个或多个列,得到新的DF
scala> df1.drop("mgr")
res42: org.apache.spark.sql.DataFrame = [EMPNO: int, ENAME: string ... 5 more fields]
scala> df1.drop("empno", "mgr")
res43: org.apache.spark.sql.DataFrame = [ENAME: string, JOB: string ... 4 more fields]
~~~ # withColumn,修改列值
scala> val df2 = df1.withColumn("sal", $"sal"+1000)
df2: org.apache.spark.sql.DataFrame = [EMPNO: int, ENAME: string ... 6 more fields]
scala> df2.show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| sal|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13|1800|null| 20|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|2600| 300| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|2250| 500| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|3975|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|2250|1400| 30|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|3850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|3450|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|4000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|6000|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|2500| 0| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|2100|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13|1950|null| 30|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|4000|null| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|2300|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
~~~ # withColumnRenamed,更改列名
~~~ 备注:drop、withColumn、withColumnRenamed返回的是DF
scala> df1.withColumnRenamed("sal", "newsal")
res45: org.apache.spark.sql.DataFrame = [EMPNO: int, ENAME: string ... 6 more fields]
~~~ # cast,类型转换
scala> df1.selectExpr("cast(empno as string)").printSchema
root
|-- empno: string (nullable = true)
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> df1.select('empno.cast(StringType)).printSchema
root
|-- empno: string (nullable = true)
### --- 5、where相关
~~~ where == filter
~~~ # where操作
scala> df1.filter("sal>1000").show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.filter("sal>1000 and job=='MANAGER'").show
+-----+-----+-------+----+-------------------+----+----+------+
|EMPNO|ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+-----+-------+----+-------------------+----+----+------+
| 7566|JONES|MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7698|BLAKE|MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782|CLARK|MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
+-----+-----+-------+----+-------------------+----+----+------+
~~~ # filter操作
scala> df1.where("sal>1000").show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.where("sal>1000 and job=='MANAGER'").show
+-----+-----+-------+----+-------------------+----+----+------+
|EMPNO|ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+-----+-------+----+-------------------+----+----+------+
| 7566|JONES|MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7698|BLAKE|MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782|CLARK|MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
+-----+-----+-------+----+-------------------+----+----+------+
### --- 6、groupBy相关
~~~ groupBy、agg、max、min、avg、sum、count(后面5个为内置函数)
~~~ # groupBy、max、min、mean、sum1 、count(与df1.count不同)
scala> df1.groupBy("Job").sum("sal").show
+---------+--------+
| Job|sum(sal)|
+---------+--------+
| ANALYST| 6000|
| SALESMAN| 5600|
| CLERK| 4150|
| MANAGER| 8275|
|PRESIDENT| 5000|
+---------+--------+
scala> df1.groupBy("Job").max("sal").show
+---------+--------+
| Job|max(sal)|
+---------+--------+
| ANALYST| 3000|
| SALESMAN| 1600|
| CLERK| 1300|
| MANAGER| 2975|
|PRESIDENT| 5000|
+---------+--------+
scala> df1.groupBy("Job").min("sal").show
+---------+--------+
| Job|min(sal)|
+---------+--------+
| ANALYST| 3000|
| SALESMAN| 1250|
| CLERK| 800|
| MANAGER| 2450|
|PRESIDENT| 5000|
+---------+--------+
scala> df1.groupBy("Job").avg("sal").show
+---------+------------------+
| Job| avg(sal)|
+---------+------------------+
| ANALYST| 3000.0|
| SALESMAN| 1400.0|
| CLERK| 1037.5|
| MANAGER|2758.3333333333335|
|PRESIDENT| 5000.0|
+---------+------------------+
scala> df1.groupBy("Job").count.show
+---------+-----+
| Job|count|
+---------+-----+
| ANALYST| 2|
| SALESMAN| 4|
| CLERK| 4|
| MANAGER| 3|
|PRESIDENT| 1|
+---------+-----+
~~~ # 类似having子句
scala> df1.groupBy("Job").avg("sal").where("avg(sal) > 2000").show
+---------+------------------+
| Job| avg(sal)|
+---------+------------------+
| ANALYST| 3000.0|
| MANAGER|2758.3333333333335|
|PRESIDENT| 5000.0|
+---------+------------------+
scala> df1.groupBy("Job").avg("sal").where($"avg(sal)" > 2000).show
+---------+------------------+
| Job| avg(sal)|
+---------+------------------+
| ANALYST| 3000.0|
| MANAGER|2758.3333333333335|
|PRESIDENT| 5000.0|
+---------+------------------+
~~~ # agg
scala> df1.groupBy("Job").agg("sal"->"max", "sal"->"min", "sal"->"avg", "sal"->"sum", "sal"->"count").show
+---------+--------+--------+------------------+--------+----------+
| Job|max(sal)|min(sal)| avg(sal)|sum(sal)|count(sal)|
+---------+--------+--------+------------------+--------+----------+
| ANALYST| 3000| 3000| 3000.0| 6000| 2|
| SALESMAN| 1600| 1250| 1400.0| 5600| 4|
| CLERK| 1300| 800| 1037.5| 4150| 4|
| MANAGER| 2975| 2450|2758.3333333333335| 8275| 3|
|PRESIDENT| 5000| 5000| 5000.0| 5000| 1|
+---------+--------+--------+------------------+--------+----------+
scala> df1.groupBy("deptno").agg("sal"->"max", "sal"->"min", "sal"->"avg", "sal"->"sum", "sal"->"count").show
+------+--------+--------+------------------+--------+----------+
|deptno|max(sal)|min(sal)| avg(sal)|sum(sal)|count(sal)|
+------+--------+--------+------------------+--------+----------+
| 20| 3000| 800| 2175.0| 10875| 5|
| 10| 5000| 1300|2916.6666666666665| 8750| 3|
| 30| 2850| 950|1566.6666666666667| 9400| 6|
+------+--------+--------+------------------+--------+----------+
~~~ # 这种方式更好理解
scala> df1.groupBy("Job").agg(max("sal"), min("sal"), avg("sal"), sum("sal"), count("sal")).show
+---------+--------+--------+------------------+--------+----------+
| Job|max(sal)|min(sal)| avg(sal)|sum(sal)|count(sal)|
+---------+--------+--------+------------------+--------+----------+
| ANALYST| 3000| 3000| 3000.0| 6000| 2|
| SALESMAN| 1600| 1250| 1400.0| 5600| 4|
| CLERK| 1300| 800| 1037.5| 4150| 4|
| MANAGER| 2975| 2450|2758.3333333333335| 8275| 3|
|PRESIDENT| 5000| 5000| 5000.0| 5000| 1|
+---------+--------+--------+------------------+--------+----------+
~~~ # 给列取别名
scala> df1.groupBy("Job").agg(max("sal"), min("sal"), avg("sal"),
| sum("sal"), count("sal")).withColumnRenamed("min(sal)",
| "min1").show
+---------+--------+----+------------------+--------+----------+
| Job|max(sal)|min1| avg(sal)|sum(sal)|count(sal)|
+---------+--------+----+------------------+--------+----------+
| ANALYST| 3000|3000| 3000.0| 6000| 2|
| SALESMAN| 1600|1250| 1400.0| 5600| 4|
| CLERK| 1300| 800| 1037.5| 4150| 4|
| MANAGER| 2975|2450|2758.3333333333335| 8275| 3|
|PRESIDENT| 5000|5000| 5000.0| 5000| 1|
+---------+--------+----+------------------+--------+----------+
~~~ # 给列取别名,最简便
scala> df1.groupBy("Job").agg(max("sal").as("max1"),
| min("sal").as("min2"), avg("sal").as("avg3"),
| sum("sal").as("sum4"), count("sal").as("count5")).show
+---------+----+----+------------------+----+------+
| Job|max1|min2| avg3|sum4|count5|
+---------+----+----+------------------+----+------+
| ANALYST|3000|3000| 3000.0|6000| 2|
| SALESMAN|1600|1250| 1400.0|5600| 4|
| CLERK|1300| 800| 1037.5|4150| 4|
| MANAGER|2975|2450|2758.3333333333335|8275| 3|
|PRESIDENT|5000|5000| 5000.0|5000| 1|
+---------+----+----+------------------+----+------+
### --- 7、orderBy相关
~~~ orderBy == sort
~~~ # orderBy
scala> df1.orderBy("sal").show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.orderBy($"sal").show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.orderBy($"sal".asc).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
~~~ # 降序
scala> df1.orderBy(-$"sal").show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.orderBy('sal).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.orderBy(col("sal")).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.orderBy(df1("sal")).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.orderBy($"sal".desc).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.orderBy(-'sal).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.orderBy(-'deptno, -'sal).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
~~~ # sort,以下语句等价
scala> df1.sort("sal").show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.sort($"sal").show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.sort($"sal".asc).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.sort('sal).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.sort(col("sal")).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.sort(df1("sal")).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.sort($"sal".desc).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.sort(-'sal).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.sort(-'deptno, -'sal).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
### --- 8、join相关
~~~ # 1、笛卡尔积
df1.crossJoin(df1).count
~~~ # 2、等值连接(单字段)(连接字段empno,仅显示了一次)
df1.join(df1, "empno").count
~~~ # 3、等值连接(多字段)(连接字段empno、ename,仅显示了一次)
df1.join(df1, Seq("empno", "ename")).show
~~~ # 定义第一个数据集
case class StudentAge(sno: Int, name: String, age: Int)
val lst = List(StudentAge(1,"Alice", 18), StudentAge(2,"Andy", 19), StudentAge(3,"Bob", 17), StudentAge(4,"Justin", 21),
StudentAge(5,"Cindy", 20))
val ds1 = spark.createDataset(lst)
ds1.show()
~~~ # 定义第二个数据集
case class StudentHeight(sname: String, height: Int)
val rdd = sc.makeRDD(List(StudentHeight("Alice", 160),
StudentHeight("Andy", 159), StudentHeight("Bob", 170),
StudentHeight("Cindy", 165), StudentHeight("Rose", 160)))
val ds2 = rdd.toDS
~~~ # 备注:不能使用双引号,而且这里是 ===
ds1.join(ds2, $"name"===$"sname").show
ds1.join(ds2, 'name==='sname).show
ds1.join(ds2, ds1("name")===ds2("sname")).show
ds1.join(ds2, ds1("sname")===ds2("sname"), "inner").show
~~~ # 多种连接方式
ds1.join(ds2, $"name"===$"sname").show
ds1.join(ds2, $"name"===$"sname", "inner").show
ds1.join(ds2, $"name"===$"sname", "left").show
ds1.join(ds2, $"name"===$"sname", "left_outer").show
ds1.join(ds2, $"name"===$"sname", "right").show
ds1.join(ds2, $"name"===$"sname", "right_outer").show
ds1.join(ds2, $"name"===$"sname", "outer").show
ds1.join(ds2, $"name"===$"sname", "full").show
ds1.join(ds2, $"name"===$"sname", "full_outer").show
~~~ # 备注:DS在join操作之后变成了DF
### --- 9、集合相关
~~~ union==unionAll(过期)、intersect、except
~~~ # union、unionAll、intersect、except。集合的交、并、差
scala> val ds3 = ds1.select("name")
scala> val ds4 = ds2.select("sname")
~~~ # union 求并集,不去重
scala> ds3.union(ds4).show
~~~ # unionAll、union 等价;unionAll过期方法,不建议使用
scala> ds3.unionAll(ds4).show
~~~ # intersect 求交
scala> ds3.intersect(ds4).show
~~~ # except 求差
scala> ds3.except(ds4).show
### --- 10、空值处理
~~~ na.fill、na.drop
~~~ # NaN (Not a Number)
scala> math.sqrt(-1.0)
res90: Double = NaN
scala> math.sqrt(-1.0).isNaN()
res91: Boolean = true
scala> df1.show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
~~~ # 删除所有列的空值和NaN
scala> df1.na.drop.show
+-----+------+--------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+--------+----+-------------------+----+----+------+
| 7499| ALLEN|SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7521| WARD|SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7654|MARTIN|SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7844|TURNER|SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
+-----+------+--------+----+-------------------+----+----+------+
~~~ # 删除某列的空值和NaN
scala> df1.na.drop(Array("mgr")).show
+-----+------+--------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+--------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7499| ALLEN|SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7521| WARD|SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7654|MARTIN|SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7844|TURNER|SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
+-----+------+--------+----+-------------------+----+----+------+
~~~ # 对全部列填充;对指定单列填充;对指定多列填充
scala> df1.na.fill(1000).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|1000| 20|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|1000| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|1000| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|1000| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|1000| 20|
| 7839| KING|PRESIDENT|1000|2006-03-02 22:12:13|5000|1000| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|1000| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|1000| 30|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|1000| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|1000| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.na.fill(1000, Array("comm")).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|1000| 20|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|1000| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|1000| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|1000| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|1000| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|1000| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|1000| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|1000| 30|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|1000| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|1000| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.na.fill(Map("mgr"->2000, "comm"->1000)).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|1000| 20|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|1000| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|1000| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|1000| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|1000| 20|
| 7839| KING|PRESIDENT|2000|2006-03-02 22:12:13|5000|1000| 10|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|1000| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|1000| 30|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|1000| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|1000| 10|
+-----+------+---------+----+-------------------+----+----+------+
~~~ # 对指定的值进行替换
scala> df1.na.replace("comm" :: "deptno" :: Nil, Map(0 -> 100, 10 -> 100)).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 100|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 100|
| 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 100| 30|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 100|
+-----+------+---------+----+-------------------+----+----+------+
~~~ # 查询空值列或非空值列。isNull、isNotNull为内置函数
scala> df1.filter("comm is null").show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.filter($"comm".isNull).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.filter(col("comm").isNull).show
+-----+------+---------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+---------+----+-------------------+----+----+------+
| 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20|
| 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20|
| 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30|
| 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10|
| 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20|
| 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10|
| 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20|
| 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30|
| 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20|
| 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10|
+-----+------+---------+----+-------------------+----+----+------+
scala> df1.filter("comm is not null").show
+-----+------+--------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+--------+----+-------------------+----+----+------+
| 7499| ALLEN|SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7521| WARD|SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7654|MARTIN|SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7844|TURNER|SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
+-----+------+--------+----+-------------------+----+----+------+
scala> df1.filter(col("comm").isNotNull).show
+-----+------+--------+----+-------------------+----+----+------+
|EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO|
+-----+------+--------+----+-------------------+----+----+------+
| 7499| ALLEN|SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30|
| 7521| WARD|SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30|
| 7654|MARTIN|SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30|
| 7844|TURNER|SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30|
+-----+------+--------+----+-------------------+----+----+------+
### --- 11、窗口函数
~~~ 一般情况下窗口函数不用 DSL 处理,直接用SQL更方便
~~~ 参考源码Window.scala、WindowSpec.scala(主要)
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> val w1 = Window.partitionBy("cookieid").orderBy("createtime")
w1: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@32a97ef8
scala> val w2 = Window.partitionBy("cookieid").orderBy("pv")
w2: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@dac633
scala> val w3 = w1.rowsBetween(Window.unboundedPreceding,
| Window.currentRow)
w3: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@2b1e2939
scala> val w4 = w1.rowsBetween(-1, 1)
w4: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@217d9e78
~~~ # 聚组函数【用分析函数的数据集】
scala> df.select($"cookieid", $"pv", sum("pv").over(w1).alias("pv1")).show
scala> df.select($"cookieid", $"pv", sum("pv").over(w3).alias("pv1")).show
scala> df.select($"cookieid", $"pv", sum("pv").over(w4).as("pv1")).show
~~~ # 排名
scala> df.select($"cookieid", $"pv", rank().over(w2).alias("rank")).show
scala> df.select($"cookieid", $"pv", dense_rank().over(w2).alias("denserank")).show
scala> df.select($"cookieid", $"pv", row_number().over(w2).alias("rownumber")).show
~~~ # lag、lead
scala> df.select($"cookieid", $"pv", lag("pv", 2).over(w2).alias("rownumber")).show
scala> df.select($"cookieid", $"pv", lag("pv", -2).over(w2).alias("rownumber")).show
### --- 12、内建函数
~~~ http://spark.apache.org/docs/latest/api/sql/index.html
二、编程代码实现### --- 编程代码实现
package cn.yanqi.sparksql
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
object TransformationDemo {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("Demo1")
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
sc.setLogLevel("warn")
import spark.implicits._
val df1: DataFrame = spark.read
.option("header", "true")
.option("inferschema", "true")
.csv("data/emp.dat")
// df1.printSchema()
// df1.map(row=>row.getAs[Int](0)).show
// // randomSplit(与RDD类似,将DF、DS按给定参数分成多份)
// val Array(dfx, dfy, dfz) = df1.randomSplit(Array(0.5, 0.6, 0.7))
// dfx.count
// dfy.count
// dfz.count
//
// // 取10行数据生成新的DataSet
// val df2 = df1.limit(10)
//
// // distinct,去重
// val df3 = df1.union(df1)
// df3.distinct.count
//
// // dropDuplicates,按列值去重
// df2.dropDuplicates.show
// df2.dropDuplicates("mgr", "deptno").show
// df2.dropDuplicates("mgr").show
// df2.dropDuplicates("deptno").show
//
// // 返回全部列的统计(count、mean、stddev、min、max)
// df1.describe().show
//
// // 返回指定列的统计
// df1.describe("sal").show
// df1.describe("sal", "comm").show
//
// df1.createOrReplaceTempView("t1")
// spark.sql("select * from t1").show
// spark.catalog.cacheTable("t1")
// spark.catalog.uncacheTable("t1")
import org.apache.spark.sql.functions._
df1.groupBy("Job").agg(min("sal").as("minsal"), max("sal").as("maxsal")).where($"minsal" > 2000).show
val lst = List(StudentAge(1,"Alice", 18), StudentAge(2,"Andy", 19), StudentAge(3,"Bob", 17), StudentAge(4,"Justin", 21), StudentAge(5,"Cindy", 20))
val ds1 = spark.createDataset(lst)
ds1.show()
// 定义第二个数据集
val rdd = sc.makeRDD(List(StudentHeight("Alice", 160), StudentHeight("Andy", 159), StudentHeight("Bob", 170), StudentHeight("Cindy", 165), StudentHeight("Rose", 160)))
val ds2 = rdd.toDS
spark.close()
}
}
case class StudentAge(sno: Int, sname: String, age: Int)
case class StudentHeight(sname: String, height: Int)
### --- 编译打印
~~~ # 准备数据文件:data/emp.dat
~~~ # 编译打印
D:\JAVA\jdk1.8.0_231\bin\java.exe "-javaagent:D:\IntelliJIDEA\IntelliJ IDEA 2019.3.3\lib\idea_rt.jar=56671:D:\IntelliJIDEA\IntelliJ IDEA 2019.3.3\bin" -Dfile.encoding=UTF-8 -classpath D:\JAVA\jdk1.8.0_231\jre\lib\charsets.jar;D:\JAVA\jdk1.8.0_231\jre\lib\deploy.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\access-bridge-64.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\cldrdata.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\dnsns.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\jaccess.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\jfxrt.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\localedata.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\nashorn.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\sunec.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\sunjce_provider.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\sunmscapi.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\sunpkcs11.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\zipfs.jar;D:\JAVA\jdk1.8.0_231\jre\lib\javaws.jar;D:\JAVA\jdk1.8.0_231\jre\lib\jce.jar;D:\JAVA\jdk1.8.0_231\jre\lib\jfr.jar;D:\JAVA\jdk1.8.0_231\jre\lib\jfxswt.jar;D:\JAVA\jdk1.8.0_231\jre\lib\jsse.jar;D:\JAVA\jdk1.8.0_231\jre\lib\management-agent.jar;D:\JAVA\jdk1.8.0_231\jre\lib\plugin.jar;D:\JAVA\jdk1.8.0_231\jre\lib\resources.jar;D:\JAVA\jdk1.8.0_231\jre\lib\rt.jar;E:\NO.Z.80000.Hadoop.spark\SparkBigData\target\classes;C:\Users\Administrator\.m2\repository\org\scala-lang\scala-library\2.12.10\scala-library-2.12.10.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-core_2.12\2.4.5\spark-core_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\com\thoughtworks\paranamer\paranamer\2.8\paranamer-2.8.jar;C:\Users\Administrator\.m2\repository\org\apache\avro\avro\1.8.2\avro-1.8.2.jar;C:\Users\Administrator\.m2\repository\org\codehaus\jackson\jackson-core-asl\1.9.13\jackson-core-asl-1.9.13.jar;C:\Users\Administrator\.m2\repository\org\apache\commons\commons-compress\1.8.1\commons-compress-1.8.1.jar;C:\Users\Administrator\.m2\repository\org\tukaani\xz\1.5\xz-1.5.jar;C:\Users\Administrator\.m2\repository\org\apache\avro\avro-mapred\1.8.2\avro-mapred-1.8.2-hadoop2.jar;C:\Users\Administrator\.m2\repository\org\apache\avro\avro-ipc\1.8.2\avro-ipc-1.8.2.jar;C:\Users\Administrator\.m2\repository\com\twitter\chill_2.12\0.9.3\chill_2.12-0.9.3.jar;C:\Users\Administrator\.m2\repository\com\esotericsoftware\kryo-shaded\4.0.2\kryo-shaded-4.0.2.jar;C:\Users\Administrator\.m2\repository\com\esotericsoftware\minlog\1.3.0\minlog-1.3.0.jar;C:\Users\Administrator\.m2\repository\org\objenesis\objenesis\2.5.1\objenesis-2.5.1.jar;C:\Users\Administrator\.m2\repository\com\twitter\chill-java\0.9.3\chill-java-0.9.3.jar;C:\Users\Administrator\.m2\repository\org\apache\xbean\xbean-asm6-shaded\4.8\xbean-asm6-shaded-4.8.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-client\2.6.5\hadoop-client-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-common\2.6.5\hadoop-common-2.6.5.jar;C:\Users\Administrator\.m2\repository\xmlenc\xmlenc\0.52\xmlenc-0.52.jar;C:\Users\Administrator\.m2\repository\commons-collections\commons-collections\3.2.2\commons-collections-3.2.2.jar;C:\Users\Administrator\.m2\repository\commons-configuration\commons-configuration\1.6\commons-configuration-1.6.jar;C:\Users\Administrator\.m2\repository\commons-digester\commons-digester\1.8\commons-digester-1.8.jar;C:\Users\Administrator\.m2\repository\commons-beanutils\commons-beanutils\1.7.0\commons-beanutils-1.7.0.jar;C:\Users\Administrator\.m2\repository\com\google\code\gson\gson\2.2.4\gson-2.2.4.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-auth\2.6.5\hadoop-auth-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\directory\server\apacheds-kerberos-codec\2.0.0-M15\apacheds-kerberos-codec-2.0.0-M15.jar;C:\Users\Administrator\.m2\repository\org\apache\directory\server\apacheds-i18n\2.0.0-M15\apacheds-i18n-2.0.0-M15.jar;C:\Users\Administrator\.m2\repository\org\apache\directory\api\api-asn1-api\1.0.0-M20\api-asn1-api-1.0.0-M20.jar;C:\Users\Administrator\.m2\repository\org\apache\directory\api\api-util\1.0.0-M20\api-util-1.0.0-M20.jar;C:\Users\Administrator\.m2\repository\org\apache\curator\curator-client\2.6.0\curator-client-2.6.0.jar;C:\Users\Administrator\.m2\repository\org\htrace\htrace-core\3.0.4\htrace-core-3.0.4.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-hdfs\2.6.5\hadoop-hdfs-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\mortbay\jetty\jetty-util\6.1.26\jetty-util-6.1.26.jar;C:\Users\Administrator\.m2\repository\xerces\xercesImpl\2.9.1\xercesImpl-2.9.1.jar;C:\Users\Administrator\.m2\repository\xml-apis\xml-apis\1.3.04\xml-apis-1.3.04.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-mapreduce-client-app\2.6.5\hadoop-mapreduce-client-app-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-mapreduce-client-common\2.6.5\hadoop-mapreduce-client-common-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-yarn-client\2.6.5\hadoop-yarn-client-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-yarn-server-common\2.6.5\hadoop-yarn-server-common-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-mapreduce-client-shuffle\2.6.5\hadoop-mapreduce-client-shuffle-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-yarn-api\2.6.5\hadoop-yarn-api-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-mapreduce-client-core\2.6.5\hadoop-mapreduce-client-core-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-yarn-common\2.6.5\hadoop-yarn-common-2.6.5.jar;C:\Users\Administrator\.m2\repository\javax\xml\bind\jaxb-api\2.2.2\jaxb-api-2.2.2.jar;C:\Users\Administrator\.m2\repository\javax\xml\stream\stax-api\1.0-2\stax-api-1.0-2.jar;C:\Users\Administrator\.m2\repository\org\codehaus\jackson\jackson-jaxrs\1.9.13\jackson-jaxrs-1.9.13.jar;C:\Users\Administrator\.m2\repository\org\codehaus\jackson\jackson-xc\1.9.13\jackson-xc-1.9.13.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-mapreduce-client-jobclient\2.6.5\hadoop-mapreduce-client-jobclient-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-annotations\2.6.5\hadoop-annotations-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-launcher_2.12\2.4.5\spark-launcher_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-kvstore_2.12\2.4.5\spark-kvstore_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\org\fusesource\leveldbjni\leveldbjni-all\1.8\leveldbjni-all-1.8.jar;C:\Users\Administrator\.m2\repository\com\fasterxml\jackson\core\jackson-core\2.6.7\jackson-core-2.6.7.jar;C:\Users\Administrator\.m2\repository\com\fasterxml\jackson\core\jackson-annotations\2.6.7\jackson-annotations-2.6.7.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-network-common_2.12\2.4.5\spark-network-common_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-network-shuffle_2.12\2.4.5\spark-network-shuffle_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-unsafe_2.12\2.4.5\spark-unsafe_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\javax\activation\activation\1.1.1\activation-1.1.1.jar;C:\Users\Administrator\.m2\repository\org\apache\curator\curator-recipes\2.6.0\curator-recipes-2.6.0.jar;C:\Users\Administrator\.m2\repository\org\apache\curator\curator-framework\2.6.0\curator-framework-2.6.0.jar;C:\Users\Administrator\.m2\repository\com\google\guava\guava\16.0.1\guava-16.0.1.jar;C:\Users\Administrator\.m2\repository\org\apache\zookeeper\zookeeper\3.4.6\zookeeper-3.4.6.jar;C:\Users\Administrator\.m2\repository\javax\servlet\javax.servlet-api\3.1.0\javax.servlet-api-3.1.0.jar;C:\Users\Administrator\.m2\repository\org\apache\commons\commons-lang3\3.5\commons-lang3-3.5.jar;C:\Users\Administrator\.m2\repository\org\apache\commons\commons-math3\3.4.1\commons-math3-3.4.1.jar;C:\Users\Administrator\.m2\repository\com\google\code\findbugs\jsr305\1.3.9\jsr305-1.3.9.jar;C:\Users\Administrator\.m2\repository\org\slf4j\slf4j-api\1.7.16\slf4j-api-1.7.16.jar;C:\Users\Administrator\.m2\repository\org\slf4j\jul-to-slf4j\1.7.16\jul-to-slf4j-1.7.16.jar;C:\Users\Administrator\.m2\repository\org\slf4j\jcl-over-slf4j\1.7.16\jcl-over-slf4j-1.7.16.jar;C:\Users\Administrator\.m2\repository\log4j\log4j\1.2.17\log4j-1.2.17.jar;C:\Users\Administrator\.m2\repository\org\slf4j\slf4j-log4j12\1.7.16\slf4j-log4j12-1.7.16.jar;C:\Users\Administrator\.m2\repository\com\ning\compress-lzf\1.0.3\compress-lzf-1.0.3.jar;C:\Users\Administrator\.m2\repository\org\xerial\snappy\snappy-java\1.1.7.3\snappy-java-1.1.7.3.jar;C:\Users\Administrator\.m2\repository\org\lz4\lz4-java\1.4.0\lz4-java-1.4.0.jar;C:\Users\Administrator\.m2\repository\com\github\luben\zstd-jni\1.3.2-2\zstd-jni-1.3.2-2.jar;C:\Users\Administrator\.m2\repository\org\roaringbitmap\RoaringBitmap\0.7.45\RoaringBitmap-0.7.45.jar;C:\Users\Administrator\.m2\repository\org\roaringbitmap\shims\0.7.45\shims-0.7.45.jar;C:\Users\Administrator\.m2\repository\commons-net\commons-net\3.1\commons-net-3.1.jar;C:\Users\Administrator\.m2\repository\org\json4s\json4s-jackson_2.12\3.5.3\json4s-jackson_2.12-3.5.3.jar;C:\Users\Administrator\.m2\repository\org\json4s\json4s-core_2.12\3.5.3\json4s-core_2.12-3.5.3.jar;C:\Users\Administrator\.m2\repository\org\json4s\json4s-ast_2.12\3.5.3\json4s-ast_2.12-3.5.3.jar;C:\Users\Administrator\.m2\repository\org\json4s\json4s-scalap_2.12\3.5.3\json4s-scalap_2.12-3.5.3.jar;C:\Users\Administrator\.m2\repository\org\scala-lang\modules\scala-xml_2.12\1.0.6\scala-xml_2.12-1.0.6.jar;C:\Users\Administrator\.m2\repository\org\glassfish\jersey\core\jersey-client\2.22.2\jersey-client-2.22.2.jar;C:\Users\Administrator\.m2\repository\javax\ws\rs\javax.ws.rs-api\2.0.1\javax.ws.rs-api-2.0.1.jar;C:\Users\Administrator\.m2\repository\org\glassfish\hk2\hk2-api\2.4.0-b34\hk2-api-2.4.0-b34.jar;C:\Users\Administrator\.m2\repository\org\glassfish\hk2\hk2-utils\2.4.0-b34\hk2-utils-2.4.0-b34.jar;C:\Users\Administrator\.m2\repository\org\glassfish\hk2\external\aopalliance-repackaged\2.4.0-b34\aopalliance-repackaged-2.4.0-b34.jar;C:\Users\Administrator\.m2\repository\org\glassfish\hk2\external\javax.inject\2.4.0-b34\javax.inject-2.4.0-b34.jar;C:\Users\Administrator\.m2\repository\org\glassfish\hk2\hk2-locator\2.4.0-b34\hk2-locator-2.4.0-b34.jar;C:\Users\Administrator\.m2\repository\org\javassist\javassist\3.18.1-GA\javassist-3.18.1-GA.jar;C:\Users\Administrator\.m2\repository\org\glassfish\jersey\core\jersey-common\2.22.2\jersey-common-2.22.2.jar;C:\Users\Administrator\.m2\repository\javax\annotation\javax.annotation-api\1.2\javax.annotation-api-1.2.jar;C:\Users\Administrator\.m2\repository\org\glassfish\jersey\bundles\repackaged\jersey-guava\2.22.2\jersey-guava-2.22.2.jar;C:\Users\Administrator\.m2\repository\org\glassfish\hk2\osgi-resource-locator\1.0.1\osgi-resource-locator-1.0.1.jar;C:\Users\Administrator\.m2\repository\org\glassfish\jersey\core\jersey-server\2.22.2\jersey-server-2.22.2.jar;C:\Users\Administrator\.m2\repository\org\glassfish\jersey\media\jersey-media-jaxb\2.22.2\jersey-media-jaxb-2.22.2.jar;C:\Users\Administrator\.m2\repository\javax\validation\validation-api\1.1.0.Final\validation-api-1.1.0.Final.jar;C:\Users\Administrator\.m2\repository\org\glassfish\jersey\containers\jersey-container-servlet\2.22.2\jersey-container-servlet-2.22.2.jar;C:\Users\Administrator\.m2\repository\org\glassfish\jersey\containers\jersey-container-servlet-core\2.22.2\jersey-container-servlet-core-2.22.2.jar;C:\Users\Administrator\.m2\repository\io\netty\netty-all\4.1.42.Final\netty-all-4.1.42.Final.jar;C:\Users\Administrator\.m2\repository\io\netty\netty\3.9.9.Final\netty-3.9.9.Final.jar;C:\Users\Administrator\.m2\repository\com\clearspring\analytics\stream\2.7.0\stream-2.7.0.jar;C:\Users\Administrator\.m2\repository\io\dropwizard\metrics\metrics-core\3.1.5\metrics-core-3.1.5.jar;C:\Users\Administrator\.m2\repository\io\dropwizard\metrics\metrics-jvm\3.1.5\metrics-jvm-3.1.5.jar;C:\Users\Administrator\.m2\repository\io\dropwizard\metrics\metrics-json\3.1.5\metrics-json-3.1.5.jar;C:\Users\Administrator\.m2\repository\io\dropwizard\metrics\metrics-graphite\3.1.5\metrics-graphite-3.1.5.jar;C:\Users\Administrator\.m2\repository\com\fasterxml\jackson\core\jackson-databind\2.6.7.3\jackson-databind-2.6.7.3.jar;C:\Users\Administrator\.m2\repository\com\fasterxml\jackson\module\jackson-module-scala_2.12\2.6.7.1\jackson-module-scala_2.12-2.6.7.1.jar;C:\Users\Administrator\.m2\repository\org\scala-lang\scala-reflect\2.12.1\scala-reflect-2.12.1.jar;C:\Users\Administrator\.m2\repository\com\fasterxml\jackson\module\jackson-module-paranamer\2.7.9\jackson-module-paranamer-2.7.9.jar;C:\Users\Administrator\.m2\repository\org\apache\ivy\ivy\2.4.0\ivy-2.4.0.jar;C:\Users\Administrator\.m2\repository\oro\oro\2.0.8\oro-2.0.8.jar;C:\Users\Administrator\.m2\repository\net\razorvine\pyrolite\4.13\pyrolite-4.13.jar;C:\Users\Administrator\.m2\repository\net\sf\py4j\py4j\0.10.7\py4j-0.10.7.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-tags_2.12\2.4.5\spark-tags_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\org\apache\commons\commons-crypto\1.0.0\commons-crypto-1.0.0.jar;C:\Users\Administrator\.m2\repository\org\spark-project\spark\unused\1.0.0\unused-1.0.0.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-sql_2.12\2.4.5\spark-sql_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\com\univocity\univocity-parsers\2.7.3\univocity-parsers-2.7.3.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-sketch_2.12\2.4.5\spark-sketch_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-catalyst_2.12\2.4.5\spark-catalyst_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\org\scala-lang\modules\scala-parser-combinators_2.12\1.1.0\scala-parser-combinators_2.12-1.1.0.jar;C:\Users\Administrator\.m2\repository\org\codehaus\janino\janino\3.0.9\janino-3.0.9.jar;C:\Users\Administrator\.m2\repository\org\codehaus\janino\commons-compiler\3.0.9\commons-compiler-3.0.9.jar;C:\Users\Administrator\.m2\repository\org\antlr\antlr4-runtime\4.7\antlr4-runtime-4.7.jar;C:\Users\Administrator\.m2\repository\org\apache\orc\orc-core\1.5.5\orc-core-1.5.5-nohive.jar;C:\Users\Administrator\.m2\repository\org\apache\orc\orc-shims\1.5.5\orc-shims-1.5.5.jar;C:\Users\Administrator\.m2\repository\com\google\protobuf\protobuf-java\2.5.0\protobuf-java-2.5.0.jar;C:\Users\Administrator\.m2\repository\commons-lang\commons-lang\2.6\commons-lang-2.6.jar;C:\Users\Administrator\.m2\repository\io\airlift\aircompressor\0.10\aircompressor-0.10.jar;C:\Users\Administrator\.m2\repository\org\apache\orc\orc-mapreduce\1.5.5\orc-mapreduce-1.5.5-nohive.jar;C:\Users\Administrator\.m2\repository\org\apache\parquet\parquet-column\1.10.1\parquet-column-1.10.1.jar;C:\Users\Administrator\.m2\repository\org\apache\parquet\parquet-common\1.10.1\parquet-common-1.10.1.jar;C:\Users\Administrator\.m2\repository\org\apache\parquet\parquet-encoding\1.10.1\parquet-encoding-1.10.1.jar;C:\Users\Administrator\.m2\repository\org\apache\parquet\parquet-hadoop\1.10.1\parquet-hadoop-1.10.1.jar;C:\Users\Administrator\.m2\repository\org\apache\parquet\parquet-format\2.4.0\parquet-format-2.4.0.jar;C:\Users\Administrator\.m2\repository\org\apache\parquet\parquet-jackson\1.10.1\parquet-jackson-1.10.1.jar;C:\Users\Administrator\.m2\repository\org\apache\arrow\arrow-vector\0.10.0\arrow-vector-0.10.0.jar;C:\Users\Administrator\.m2\repository\org\apache\arrow\arrow-format\0.10.0\arrow-format-0.10.0.jar;C:\Users\Administrator\.m2\repository\org\apache\arrow\arrow-memory\0.10.0\arrow-memory-0.10.0.jar;C:\Users\Administrator\.m2\repository\com\carrotsearch\hppc\0.7.2\hppc-0.7.2.jar;C:\Users\Administrator\.m2\repository\com\vlkan\flatbuffers\1.2.0-3f79e055\flatbuffers-1.2.0-3f79e055.jar;C:\Users\Administrator\.m2\repository\joda-time\joda-time\2.9.7\joda-time-2.9.7.jar;C:\Users\Administrator\.m2\repository\mysql\mysql-connector-java\5.1.44\mysql-connector-java-5.1.44.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-hive_2.12\2.4.5\spark-hive_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\com\twitter\parquet-hadoop-bundle\1.6.0\parquet-hadoop-bundle-1.6.0.jar;C:\Users\Administrator\.m2\repository\org\spark-project\hive\hive-exec\1.2.1.spark2\hive-exec-1.2.1.spark2.jar;C:\Users\Administrator\.m2\repository\commons-io\commons-io\2.4\commons-io-2.4.jar;C:\Users\Administrator\.m2\repository\javolution\javolution\5.5.1\javolution-5.5.1.jar;C:\Users\Administrator\.m2\repository\log4j\apache-log4j-extras\1.2.17\apache-log4j-extras-1.2.17.jar;C:\Users\Administrator\.m2\repository\org\antlr\antlr-runtime\3.4\antlr-runtime-3.4.jar;C:\Users\Administrator\.m2\repository\org\antlr\stringtemplate\3.2.1\stringtemplate-3.2.1.jar;C:\Users\Administrator\.m2\repository\antlr\antlr\2.7.7\antlr-2.7.7.jar;C:\Users\Administrator\.m2\repository\org\antlr\ST4\4.0.4\ST4-4.0.4.jar;C:\Users\Administrator\.m2\repository\com\googlecode\javaewah\JavaEWAH\0.3.2\JavaEWAH-0.3.2.jar;C:\Users\Administrator\.m2\repository\org\iq80\snappy\snappy\0.2\snappy-0.2.jar;C:\Users\Administrator\.m2\repository\stax\stax-api\1.0.1\stax-api-1.0.1.jar;C:\Users\Administrator\.m2\repository\net\sf\opencsv\opencsv\2.3\opencsv-2.3.jar;C:\Users\Administrator\.m2\repository\org\spark-project\hive\hive-metastore\1.2.1.spark2\hive-metastore-1.2.1.spark2.jar;C:\Users\Administrator\.m2\repository\com\jolbox\bonecp\0.8.0.RELEASE\bonecp-0.8.0.RELEASE.jar;C:\Users\Administrator\.m2\repository\commons-cli\commons-cli\1.2\commons-cli-1.2.jar;C:\Users\Administrator\.m2\repository\commons-logging\commons-logging\1.1.3\commons-logging-1.1.3.jar;C:\Users\Administrator\.m2\repository\org\datanucleus\datanucleus-api-jdo\3.2.6\datanucleus-api-jdo-3.2.6.jar;C:\Users\Administrator\.m2\repository\org\datanucleus\datanucleus-rdbms\3.2.9\datanucleus-rdbms-3.2.9.jar;C:\Users\Administrator\.m2\repository\commons-pool\commons-pool\1.5.4\commons-pool-1.5.4.jar;C:\Users\Administrator\.m2\repository\commons-dbcp\commons-dbcp\1.4\commons-dbcp-1.4.jar;C:\Users\Administrator\.m2\repository\javax\jdo\jdo-api\3.0.1\jdo-api-3.0.1.jar;C:\Users\Administrator\.m2\repository\javax\transaction\jta\1.1\jta-1.1.jar;C:\Users\Administrator\.m2\repository\commons-httpclient\commons-httpclient\3.1\commons-httpclient-3.1.jar;C:\Users\Administrator\.m2\repository\org\apache\calcite\calcite-avatica\1.2.0-incubating\calcite-avatica-1.2.0-incubating.jar;C:\Users\Administrator\.m2\repository\org\apache\calcite\calcite-core\1.2.0-incubating\calcite-core-1.2.0-incubating.jar;C:\Users\Administrator\.m2\repository\org\apache\calcite\calcite-linq4j\1.2.0-incubating\calcite-linq4j-1.2.0-incubating.jar;C:\Users\Administrator\.m2\repository\net\hydromatic\eigenbase-properties\1.1.5\eigenbase-properties-1.1.5.jar;C:\Users\Administrator\.m2\repository\org\apache\httpcomponents\httpclient\4.5.6\httpclient-4.5.6.jar;C:\Users\Administrator\.m2\repository\org\apache\httpcomponents\httpcore\4.4.10\httpcore-4.4.10.jar;C:\Users\Administrator\.m2\repository\org\codehaus\jackson\jackson-mapper-asl\1.9.13\jackson-mapper-asl-1.9.13.jar;C:\Users\Administrator\.m2\repository\commons-codec\commons-codec\1.10\commons-codec-1.10.jar;C:\Users\Administrator\.m2\repository\org\jodd\jodd-core\3.5.2\jodd-core-3.5.2.jar;C:\Users\Administrator\.m2\repository\org\datanucleus\datanucleus-core\3.2.10\datanucleus-core-3.2.10.jar;C:\Users\Administrator\.m2\repository\org\apache\thrift\libthrift\0.9.3\libthrift-0.9.3.jar;C:\Users\Administrator\.m2\repository\org\apache\thrift\libfb303\0.9.3\libfb303-0.9.3.jar;C:\Users\Administrator\.m2\repository\org\apache\derby\derby\10.12.1.1\derby-10.12.1.1.jar cn.yanqi.sparksql.TransformationDemo
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
+---------+------+------+
| Job|minsal|maxsal|
+---------+------+------+
| ANALYST| 3000| 3000|
| MANAGER| 2450| 2975|
|PRESIDENT| 5000| 5000|
+---------+------+------+
+---+------+---+
|sno| sname|age|
+---+------+---+
| 1| Alice| 18|
| 2| Andy| 19|
| 3| Bob| 17|
| 4|Justin| 21|
| 5| Cindy| 20|
+---+------+---+
Process finished with exit code 0
===============================END===============================
Walter Savage Landor:strove with none,for none was worth my strife.Nature I loved and, next to Nature, Art:I warm'd both hands before the fire of life.It sinks, and I am ready to depart ——W.S.Landor
来自为知笔记(Wiz)
标签:V05,v05,13,12,22,02,01,Spark,null 来源: https://www.cnblogs.com/yanqivip/p/16134671.html