其他分享
首页 > 其他分享> > (四)pandas开源课程学习笔记——数据清洗及特征处理

(四)pandas开源课程学习笔记——数据清洗及特征处理

作者:互联网

这里写目录标题

开始之前,导入numpy、pandas包和数据

import pandas as pd
import numpy as np
#加载数据train.csv
df = pd.read_csv("train.csv")
df.head(3)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS

2 第二章:数据清洗及特征处理

我们拿到的数据通常是不干净的,所谓的不干净,就是数据中有缺失值,有一些异常点等,需要经过一定的处理才能继续做后面的分析或建模,所以拿到数据的第一步是进行数据清洗,本章我们将学习缺失值、重复值、字符串和数据转换等操作,将数据清洗成可以分析或建模的亚子。

2.1 缺失值观察与处理

我们拿到的数据经常会有很多缺失值,比如我们可以看到Cabin列存在NaN,那其他列还有没有缺失值,这些缺失值要怎么处理呢

2.1.1 任务一:缺失值观察

(1) 请查看每个特征缺失值个数
(2) 请查看Age, Cabin, Embarked列的数据

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
df.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
df[['Age','Cabin','Embarked']].head(3)
AgeCabinEmbarked
022.0NaNS
138.0C85C
226.0NaNS

2.1.2 任务二:对缺失值进行处理

(1)处理缺失值一般有几种思路

答:常见的缺失值补全方法:均值插补、同类均值插补、建模预测、高维映射、多重插补、极大似然估计、压缩感知和矩阵补全。具体查看下放链接
机器学习中处理缺失值的9种方法https://zhuanlan.zhihu.com/p/270551105

(2) 请尝试对Age列的数据的缺失值进行处理

df[df['Age'] == None] = 0
df.head(8)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
5603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
6701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
7803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
df[df['Age'].isnull()] = 0 
df.head(8)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
5000000.00000.000000
6701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
7803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
df[df['Age'] == np.nan] = 0
df.head(8)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
5000000.00000.000000
6701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
7803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS

【思考】检索空缺值用np.nan,None以及.isnull()哪个更好,这是为什么?如果其中某个方式无法找到缺失值,原因又是为什么?

【回答】数值列读取数据后,空缺值的数据类型为float64所以用None一般索引不到,比较的时候最好用np.nan

(3) 请尝试使用不同的方法直接对整张表的缺失值进行处理

df.dropna().head(8)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
5000000.00000.000000
6701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
101113Sandstrom, Miss. Marguerite Rutfemale4.011PP 954916.7000G6S
111211Bonnell, Miss. Elizabethfemale58.00011378326.5500C103S
17000000.00000.000000
19000000.00000.000000
df.fillna(0).head(8)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.25000S
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.92500S
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.05000S
5000000.00000.000000
6701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
7803Palsson, Master. Gosta Leonardmale2.03134990921.07500S

【思考】dropna和fillna有哪些参数,分别如何使用呢?

(1)df.dropna()函数用于删除dataframe数据中的缺失数据,即 删除NaN数据

DataFrame.dropna(axis=0, how=‘any’, thresh=None, subset=None, inplace=False)

parameters说明
axis0为行 1为列,default 0,数据删除维度
how{‘any’, ‘all’}, default ‘any’,any:删除带有nan的行;all:删除全为nan的行
threshint,保留至少 int 个非nan行
subsetlist,在特定列缺失值处理
inplacebool,是否修改源文件

(2)df.fillna()函数用于填充确实数据

DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)

parameters说明
value用于填充孔的值(例如 0)
method{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, 无},默认无
axis0为行 1为列,沿其填充缺失值的轴
inplacebool,是否修改源文件
limit整数,默认无,如果指定了方法,则这是向前/向后填充的连续 NaN 值的最大数量
downcastdict,默认为 None

2.2 重复值观察与处理

由于这样那样的原因,数据中会不会存在重复值呢,如果存在要怎样处理呢

2.2.1 任务一:请查看数据中的重复值

df[df.duplicated()]
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
17000000.00000.000
19000000.00000.000
26000000.00000.000
28000000.00000.000
29000000.00000.000
31000000.00000.000
32000000.00000.000
36000000.00000.000
42000000.00000.000
45000000.00000.000
46000000.00000.000
47000000.00000.000
48000000.00000.000
55000000.00000.000
64000000.00000.000
65000000.00000.000
76000000.00000.000
77000000.00000.000
82000000.00000.000
87000000.00000.000
95000000.00000.000
101000000.00000.000
107000000.00000.000
109000000.00000.000
121000000.00000.000
126000000.00000.000
128000000.00000.000
140000000.00000.000
154000000.00000.000
158000000.00000.000
.......................................
718000000.00000.000
727000000.00000.000
732000000.00000.000
738000000.00000.000
739000000.00000.000
740000000.00000.000
760000000.00000.000
766000000.00000.000
768000000.00000.000
773000000.00000.000
776000000.00000.000
778000000.00000.000
783000000.00000.000
790000000.00000.000
792000000.00000.000
793000000.00000.000
815000000.00000.000
825000000.00000.000
826000000.00000.000
828000000.00000.000
832000000.00000.000
837000000.00000.000
839000000.00000.000
846000000.00000.000
849000000.00000.000
859000000.00000.000
863000000.00000.000
868000000.00000.000
878000000.00000.000
888000000.00000.000

176 rows × 12 columns

2.2.2 任务二:对重复值进行处理

df = df.drop_duplicates()
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

2.2.3 任务三:将前面清洗的数据保存为csv格式

df.to_csv('test_clear.csv')

2.3 特征观察与处理

我们对特征进行一下观察,可以把特征大概分为两大类:
数值型特征:Survived ,Pclass, Age ,SibSp, Parch, Fare,其中Survived, Pclass为离散型数值特征,Age,SibSp, Parch, Fare为连续型数值特征
文本型特征:Name, Sex, Cabin,Embarked, Ticket,其中Sex, Cabin, Embarked, Ticket为类别型文本特征,数值型特征一般可以直接用于模型的训练,但有时候为了模型的稳定性及鲁棒性会对连续变量进行离散化。文本型特征往往需要转换成数值型特征才能用于建模分析。

2.3.1 任务一:对年龄进行分箱(离散化)处理

(1) 分箱操作是什么?

答:是指将值划分到离散区间。好比不同大小的苹果归类到几个事先布置的箱子中;不同年龄的人划分到几个年龄段中

(2) 将连续变量Age平均分箱成5个年龄段,并分别用类别变量12345表示

df['AgeBand'] = pd.cut(df['Age'], 5,labels = [1,2,3,4,5])
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBand
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS2
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C3
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS2
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S3
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS3

(3) 将连续变量Age划分为[0,5) [5,15) [15,30) [30,50) [50,80)五个年龄段,并分别用类别变量12345表示

df['AgeBand'] = pd.cut(df['Age'],[0,5,15,30,50,80], labels = [1,2,3,4,5])
df.head(3)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBand
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS3
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C4
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS3

(4) 将连续变量Age按10% 30% 50% 70% 90%五个年龄段,并用分类变量12345表示

df['AgeBand'] = pd.qcut(df['Age'],[0,0.1,0.3,0.5,0.7,0.9],labels = [1,2,3,4,5])
df.head(3)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBand
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS2
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C5
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS3

【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html

【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html

2.3.2 任务二:对文本变量进行转换

(1) 查看文本变量名及种类

#方法一: value_counts
df['Sex'].value_counts()
male      453
female    261
0           1
Name: Sex, dtype: int64
#方法二: unique
df['Sex'].unique()
array(['male', 'female', 0], dtype=object)

(2) 将文本变量Sex, Cabin ,Embarked用数值变量12345表示

#方法一: replace
df['Sex_num'] = df['Sex'].replace(['male','female'],[1,2])
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBandSex_num
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS21
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C52
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS32
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S42
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS41
#方法二: map
df['Sex_num'] = df['Sex'].map({'male': 1, 'female': 2})
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBandSex_num
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS21.0
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C52.0
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS32.0
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S42.0
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS41.0
#方法三: 使用sklearn.preprocessing的LabelEncoder
from sklearn.preprocessing import LabelEncoder
for feat in ['Cabin', 'Ticket']:
    lbl = LabelEncoder()  
    label_dict = dict(zip(df[feat].unique(), range(df[feat].nunique())))
    df[feat + "_labelEncode"] = df[feat].map(label_dict)
    df[feat + "_labelEncode"] = lbl.fit_transform(df[feat].astype(str))

df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBandSex_numCabin_labelEncodeTicket_labelEncode
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS21.0135409
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C52.074472
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS32.0135533
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S42.05041
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS41.0135374

(3) 将文本变量Sex, Cabin, Embarked用one-hot编码表示

#将类别文本转换为one-hot编码

#方法一: OneHotEncoder
for feat in ["Age", "Embarked"]:
#     x = pd.get_dummies(df["Age"] // 6)
#     x = pd.get_dummies(pd.cut(df['Age'],5))
    x = pd.get_dummies(df[feat], prefix=feat)
    df = pd.concat([df, x], axis=1)
    #df[feat] = pd.get_dummies(df[feat], prefix=feat)
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFare...Age_66.0Age_70.0Age_70.5Age_71.0Age_74.0Age_80.0Embarked_0Embarked_CEmbarked_QEmbarked_S
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500...0000000001
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833...0000000100
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250...0000000001
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000...0000000001
4503Allen, Mr. William Henrymale35.0003734508.0500...0000000001

5 rows × 109 columns

2.3.3 任务三:从纯文本Name特征里提取出Titles的特征(所谓的Titles就是Mr,Miss,Mrs等)

df['Title'] = df.Name.str.extract('([A-Za-z]+)\.', expand=False)
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFare...Age_70.0Age_70.5Age_71.0Age_74.0Age_80.0Embarked_0Embarked_CEmbarked_QEmbarked_STitle
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500...000000001Mr
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833...000000100Mrs
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250...000000001Miss
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000...000000001Mrs
4503Allen, Mr. William Henrymale35.0003734508.0500...000000001Mr

5 rows × 110 columns

#保存最终你完成的已经清理好的数据
df.to_csv('test_fin.csv')

标签:...,df,Age,笔记,Mrs,开源,Mr,Miss,pandas
来源: https://blog.csdn.net/qq_46378251/article/details/122771787