数据库
首页 > 数据库> > Hive学习小记-(14)如何写SQL求出中位数平均数和众数(count 之外的方法)

Hive学习小记-(14)如何写SQL求出中位数平均数和众数(count 之外的方法)

作者:互联网

平均数中位数众数

平均数、中位数、众数都是度量一组数据集中趋势的统计量。所谓集中趋势是指一组数据向某一中心值靠拢的倾向,测度集中趋势就是寻找数据一般水平的代表值或中心值。而这三个特征数又各有特点,能够从不同的角度提供信息。

平均数


中位数


众数


需求描述

一道SQL题:如何SQL求出中位数平均数和众数(count 之外的方法)

创建样例数据

import pyspark
from pyspark.sql import SparkSession

sc=SparkSession.builder.master("local")\
    .appName('first_name1')\
    .config('spark.executor.memory','2g')\
    .config('spark.driver.memory','2g')\
    .enableHiveSupport()\
    .getOrCreate()

sc.sql(''' drop  table test_youhua.test_avg_medium_freq ''')
sc.sql(''' CREATE TABLE if not exists test_youhua.test_avg_medium_freq(name string,income int) ''')
sc.sql(''' INSERT into test_youhua.test_avg_medium_freq VALUES ('桑普森', '400000'),('迈克', '30000'),('怀特', '20000'),('阿诺德', '20000')
,('史密斯', '20000'),('劳伦斯', '15000'),('哈德逊', '15000'),('肯特', '10000'),('贝克', '10000'),('斯科特', '10000') ''')
sc.sql(''' select * from test_youhua.test_avg_medium_freq ''').show()
+----+------+
|name|income|
+----+------+
| 桑普森|400000|
|  迈克| 30000|
|  怀特| 20000|
| 阿诺德| 20000|
| 史密斯| 20000|
| 劳伦斯| 15000|
| 哈德逊| 15000|
|  肯特| 10000|
|  贝克| 10000|
| 斯科特| 10000|
+----+------+

求均值(人均薪资):

#1.avg() 因为存在一个员工,多条薪资记录的情况,所以需要先分组统计
sc.sql(''' SELECT AVG(a.income) FROM (
           SELECT SUM(income) AS income FROM test_youhua.test_avg_medium_freq
           GROUP BY name
           ) AS a ''').show()
# 2.sum/人数
sc.sql(''' SELECT SUM(income)/COUNT(DISTINCT name) AS avg_income
FROM test_youhua.test_avg_medium_freq ''').show()
+-----------+
|avg(income)|
+-----------+
|    55000.0|
+-----------+

求中位数

1.用笛卡尔积,然后判断哪条元素位于中间位置,取中间位置元素均值

#1、问题显示如下所示:
#Use the CROSS JOIN syntax to allow cartesian products between these relation
#2、原因:
#Spark 2.x版本中默认不支持笛卡尔积操作
#3、解决方案:
#通过参数spark.sql.crossJoin.enabled开启,方式如下:
sc.conf.set("spark.sql.crossJoin.enabled", "true")
sc.sql(""" select avg(tmp.income) from (  --如果是奇数直接取,若是偶数取平均
--先做笛卡尔积,计算每条数据对应的上半部分(大于该条)、下半部分(小于该条)两个子集,求其交集(即中间位置元素),这时若聚合数据的数目是奇数,最后得一条,偶数得两条)
             select t1.income    
             from test_youhua.test_avg_medium_freq t1,test_youhua.test_avg_medium_freq t2
             group by t1.income
             having sum(case when t1.income>=t2.income then 1 else 0 end)>=count(*)/2
                and sum(case when t2.income>=t1.income then 1 else 0 end)>=count(*)/2
           ) as tmp
           """).show()
+-----------+
|avg(income)|
+-----------+
|    17500.0|
+-----------+

2.用percentage函数

参考:https://www.jianshu.com/p/57129421ee85

参考:https://blog.csdn.net/qq_33637730/article/details/109066665

在hive环境中,可以使用percentile(BIGINT col, p)来查找中位数,但该函数中的列只能使用整型,我们也可以使用percentile_approx()来近似中位数

percentile_approx还有一种形式percentile_approx(col, p,B),参数B控制内存消耗的近似精度,B越大,结果的精度越高。默认值为10000。当col字段中的distinct值的个数小于B时,结果就为准确的百分位数  

sc.sql(""" select percentile(income,0.5) from test_youhua.test_avg_medium_freq""").show()
+---------------------------------------+
|percentile(income, CAST(0.5 AS DOUBLE))|
+---------------------------------------+
|                                17500.0|
+---------------------------------------+

3.用row_number

sc.sql(""" select avg(income) from ( 
             select income,
             row_number() over( order by income ) num,
             count(*) over( ) cnt
             from test_youhua.test_avg_medium_freq 
            ) as tmp
            --如果是奇数,取排序中间的,如果是偶数,取两个中间的均值
         where if(cnt%2=0,num in(cnt/2,cnt/2+1),num=(cnt+1)/2)""").show()
+-----------+
|avg(income)|
+-----------+
|    17500.0|
+-----------+

求众数

参考:https://www.cnblogs.com/tgzhu/p/9946628.html

 1.用having +count(max)

#HIve没有all,any的用法,只能用max来定位出现次数最多的了
#https://issues.apache.org/jira/browse/HIVE-15229
#1. 'LIKE ANY' operator return true if a text(column value) matches to any pattern.
#2. 'LIKE ALL' operator return true if a text(column value) matches to all patterns.
#3. 'LIKE ANY' and 'LIKE ALL' returns NULL not only if the expression on the left hand side is NULL, but also if one of the pattern in the list is NULL.
#sc.sql("""SELECT income from test_youhua.test_avg_medium_freq 
#          group by income
#          where count(*)>=all(select count(*) from test_youhua.test_avg_medium_freq group by income) """).show()
sc.sql("""SELECT income from test_youhua.test_avg_medium_freq 
          group by income
          having count(*)>=(select max(num) from (select count(*) as num from test_youhua.test_avg_medium_freq group by income)) """).show()
+------+
|income|
+------+
| 10000|
| 20000|
+------+

 

 

 

标签:count,medium,14,众数,youhua,income,test,freq,avg
来源: https://www.cnblogs.com/foolangirl/p/14310103.html