Pandas数据分析----缺失值统计与分析
作者:互联网
工具包导入
(9月29号(组内)–数据分析)
import pandas as pd
print(pd.__version__)
1.2.4
数据载入
data1 = pd.read_csv('./datasets1/location_object.csv')
print(data1.head(5))
TIME K1-1 K1-2 K1-3 K1-4 K1-5 K1-6 K2-1 K2-2 K2-3 ... \
0 2020/7/19 0:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
1 2020/7/19 1:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
2 2020/7/19 2:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
3 2020/7/19 3:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
4 2020/7/19 4:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
K4-5 K4-6 K4-7 K4-8 K5-1 K5-2 K5-3 K5-4 K5-5 K5-6
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
[5 rows x 39 columns]
数据概览
data1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 624 entries, 0 to 623
Data columns (total 39 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 TIME 624 non-null object
1 K1-1 81 non-null float64
2 K1-2 43 non-null float64
3 K1-3 74 non-null float64
4 K1-4 43 non-null float64
5 K1-5 66 non-null float64
6 K1-6 19 non-null float64
7 K2-1 31 non-null float64
8 K2-2 34 non-null float64
9 K2-3 130 non-null float64
10 K2-4 119 non-null float64
11 K2-5 77 non-null float64
12 K2-6 105 non-null float64
13 K2-7 59 non-null float64
14 K2-8 24 non-null float64
15 K3-1-1 49 non-null float64
16 K3-1-2 22 non-null float64
17 K3-2 91 non-null float64
18 K3-3 78 non-null float64
19 K3-4 85 non-null float64
20 K3-5 102 non-null float64
21 K3-6 79 non-null float64
22 K3-7 90 non-null float64
23 K3-8-1 15 non-null float64
24 K3-8-2 13 non-null float64
25 K4-1 47 non-null float64
26 K4-2 63 non-null float64
27 K4-3 128 non-null float64
28 K4-4 61 non-null float64
29 K4-5 58 non-null float64
30 K4-6 96 non-null float64
31 K4-7 28 non-null float64
32 K4-8 41 non-null float64
33 K5-1 54 non-null float64
34 K5-2 72 non-null float64
35 K5-3 65 non-null float64
36 K5-4 51 non-null float64
37 K5-5 68 non-null float64
38 K5-6 30 non-null float64
dtypes: float64(38), object(1)
memory usage: 190.2+ KB
# 数据统计描述
data1.describe()
K1-1 | K1-2 | K1-3 | K1-4 | K1-5 | K1-6 | K2-1 | K2-2 | K2-3 | K2-4 | ... | K4-5 | K4-6 | K4-7 | K4-8 | K5-1 | K5-2 | K5-3 | K5-4 | K5-5 | K5-6 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 81.0 | 43.0 | 74.0 | 43.0 | 66.0 | 19.0 | 31.0 | 34.0 | 130.0 | 119.0 | ... | 58.0 | 96.0 | 28.0 | 41.0 | 54.0 | 72.0 | 65.0 | 51.0 | 68.0 | 30.0 |
mean | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
std | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
min | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
25% | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
50% | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
75% | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
max | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
8 rows × 38 columns
缺失值统计
# 缺失值可视化
import missingno as msn
msn.matrix(data1)
<matplotlib.axes._subplots.AxesSubplot at 0x268bd314a08>
统计每一列的缺失值个数
# 非空值统计,求取每个列的非空值个数
print(data1.count())
TIME 624
K1-1 81
K1-2 43
K1-3 74
K1-4 43
K1-5 66
K1-6 19
K2-1 31
K2-2 34
K2-3 130
K2-4 119
K2-5 77
K2-6 105
K2-7 59
K2-8 24
K3-1-1 49
K3-1-2 22
K3-2 91
K3-3 78
K3-4 85
K3-5 102
K3-6 79
K3-7 90
K3-8-1 15
K3-8-2 13
K4-1 47
K4-2 63
K4-3 128
K4-4 61
K4-5 58
K4-6 96
K4-7 28
K4-8 41
K5-1 54
K5-2 72
K5-3 65
K5-4 51
K5-5 68
K5-6 30
dtype: int64
# 上述统计形式等价于
print(data1.count(axis=0))
TIME 624
K1-1 81
K1-2 43
K1-3 74
K1-4 43
K1-5 66
K1-6 19
K2-1 31
K2-2 34
K2-3 130
K2-4 119
K2-5 77
K2-6 105
K2-7 59
K2-8 24
K3-1-1 49
K3-1-2 22
K3-2 91
K3-3 78
K3-4 85
K3-5 102
K3-6 79
K3-7 90
K3-8-1 15
K3-8-2 13
K4-1 47
K4-2 63
K4-3 128
K4-4 61
K4-5 58
K4-6 96
K4-7 28
K4-8 41
K5-1 54
K5-2 72
K5-3 65
K5-4 51
K5-5 68
K5-6 30
dtype: int64
统计某一列的缺失值个数
print(data1[['K5-6']].count())
K5-6 30
dtype: int64
统计多个列的缺失值个数
print(data1[['K5-6', 'K5-5']].count())
K5-6 30
K5-5 68
dtype: int64
统计每一行的缺失值个数
# 求取每一行的缺失值个数
print(data1.count(axis=1))
0 1
1 1
2 2
3 2
4 1
..
619 1
620 1
621 1
622 1
623 1
Length: 624, dtype: int64
统计某一行的缺失值个数
print(data1.iloc[[0]].count())
1
统计多行的缺失值个数
print(data1.iloc[[0, 1]].count())
TIME 2
K1-1 0
K1-2 0
K1-3 0
K1-4 0
K1-5 0
K1-6 0
K2-1 0
K2-2 0
K2-3 0
K2-4 0
K2-5 0
K2-6 0
K2-7 0
K2-8 0
K3-1-1 0
K3-1-2 0
K3-2 0
K3-3 0
K3-4 0
K3-5 0
K3-6 0
K3-7 0
K3-8-1 0
K3-8-2 0
K4-1 0
K4-2 0
K4-3 0
K4-4 0
K4-5 0
K4-6 0
K4-7 0
K4-8 0
K5-1 0
K5-2 0
K5-3 0
K5-4 0
K5-5 0
K5-6 0
dtype: int64
对每一列进行求和
print(data1.sum())
TIME 2020/7/19 0:002020/7/19 1:002020/7/19 2:002020...
K1-1 81.0
K1-2 43.0
K1-3 74.0
K1-4 43.0
K1-5 66.0
K1-6 19.0
K2-1 31.0
K2-2 34.0
K2-3 130.0
K2-4 119.0
K2-5 77.0
K2-6 105.0
K2-7 59.0
K2-8 24.0
K3-1-1 49.0
K3-1-2 22.0
K3-2 91.0
K3-3 78.0
K3-4 85.0
K3-5 102.0
K3-6 79.0
K3-7 90.0
K3-8-1 15.0
K3-8-2 13.0
K4-1 47.0
K4-2 63.0
K4-3 128.0
K4-4 61.0
K4-5 58.0
K4-6 96.0
K4-7 28.0
K4-8 41.0
K5-1 54.0
K5-2 72.0
K5-3 65.0
K5-4 51.0
K5-5 68.0
K5-6 30.0
dtype: object
对每一行进行求和
print(data1.sum(axis=1))
0 0.0
1 0.0
2 1.0
3 1.0
4 0.0
...
619 0.0
620 0.0
621 0.0
622 0.0
623 0.0
Length: 624, dtype: float64
对单独的一行或一列进行操作
# 对某一列进行求和
print(data1['K5-6'].sum())
30.0
# 对某一行进行求和
print(data1.iloc[[0]].sum())
TIME 2020/7/19 0:00
K1-1 0.0
K1-2 0.0
K1-3 0.0
K1-4 0.0
K1-5 0.0
K1-6 0.0
K2-1 0.0
K2-2 0.0
K2-3 0.0
K2-4 0.0
K2-5 0.0
K2-6 0.0
K2-7 0.0
K2-8 0.0
K3-1-1 0.0
K3-1-2 0.0
K3-2 0.0
K3-3 0.0
K3-4 0.0
K3-5 0.0
K3-6 0.0
K3-7 0.0
K3-8-1 0.0
K3-8-2 0.0
K4-1 0.0
K4-2 0.0
K4-3 0.0
K4-4 0.0
K4-5 0.0
K4-6 0.0
K4-7 0.0
K4-8 0.0
K5-1 0.0
K5-2 0.0
K5-3 0.0
K5-4 0.0
K5-5 0.0
K5-6 0.0
dtype: object
对多个行或多个列进行操作
# 对多个列求和
print(data1[['K5-6', 'K5-5']].sum())
K5-6 30.0
K5-5 68.0
dtype: float64
# 对多行进行求和
print(data1.iloc[[0, 1]].sum())
TIME 2020/7/19 0:002020/7/19 1:00
K1-1 0.0
K1-2 0.0
K1-3 0.0
K1-4 0.0
K1-5 0.0
K1-6 0.0
K2-1 0.0
K2-2 0.0
K2-3 0.0
K2-4 0.0
K2-5 0.0
K2-6 0.0
K2-7 0.0
K2-8 0.0
K3-1-1 0.0
K3-1-2 0.0
K3-2 0.0
K3-3 0.0
K3-4 0.0
K3-5 0.0
K3-6 0.0
K3-7 0.0
K3-8-1 0.0
K3-8-2 0.0
K4-1 0.0
K4-2 0.0
K4-3 0.0
K4-4 0.0
K4-5 0.0
K4-6 0.0
K4-7 0.0
K4-8 0.0
K5-1 0.0
K5-2 0.0
K5-3 0.0
K5-4 0.0
K5-5 0.0
K5-6 0.0
dtype: object
可视化分析
import matplotlib.pyplot as plt
plt.figure(figsize=(24, 7))
plt.rcParams['font.family'] = 'SimHei'
plt.bar(data1.columns, list(data1.count(axis=0)), width=1.5)
plt.title('非空值个数统计')
plt.show()
import plotly as py
import plotly.graph_objs as go
pyplt = py.offline.plot
# Trace
trace_basic = [go.Bar(
x = data1.columns,
y = list(data1.count(axis=0)),
)]
# Layout
layout_basic = go.Layout(
title = '非空值个数统计')
# Figure
figure_basic = go.Figure(data = trace_basic, layout = layout_basic)
# Plot
pyplt(figure_basic, filename='./1.html')
'./1.html'
标签:数据分析,K3,0.0,NaN,----,K2,K5,K4,Pandas 来源: https://blog.csdn.net/AIHUBEI/article/details/120556955