其他分享
首页 > 其他分享> > 从零开始数据分析Kaggle项目——泰坦尼克号(五)

从零开始数据分析Kaggle项目——泰坦尼克号(五)

作者:互联网

从零开始数据分析Kaggle项目—泰坦尼克号2—2.1

# title: "Kaggle项目泰坦尼克号 2__2.1"
# author: "小鱼"
# date: "2021-12-17"
import pandas as pd
import numpy as np
df = pd.read_csv("train.csv")
# 查看每个特征缺失值个数
df.isna().sum()
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

查看指定列数据

# 查看Age, Cabin, Embarked列的数据
df[['Age','Cabin','Embarked']].head(6)
AgeCabinEmbarked
022.0NaNS
138.0C85C
226.0NaNS
335.0C123S
435.0NaNS
5NaNNaNQ

对缺失值处理

# # 对缺失值进行处理汇总,面对缺失值三种处理方法:
# option 1: 去掉含有缺失值的样本(行)
# option 2:将含有缺失值的列(特征向量)去掉
# option 3:将缺失值用某些值填充(0,平均值,中值等)

# df.dropna()   #删除缺失值
# df.fillna()   #填充缺失值
# df.isna()     #判断缺失值
# df.notna()    #判断缺失值
df[df['Age']==None]=0 
df[df['Age'].isna()] = 0
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        362 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
# DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
# axis:
# axis=0: 删除包含缺失值的行
# axis=1: 删除包含缺失值的列
# how: 与axis配合使用
# how=‘any’ :只要有缺失值出现,就删除该行货列
# how=‘all’: 所有的值都缺失,才删除行或列
# thresh: axis中至少有thresh个非缺失值,否则删除
# 比如 axis=0,thresh=10:标识如果该行中非缺失值的数量小于10,将删除改行
# subset: list
# 在哪些列中查看是否有缺失值
# inplace: 是否在原数据上操作。如果为真,返回None否则返回新的copy,去掉了缺失值
df.isna().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          529
Embarked         2
dtype: int64
#删除包含缺失值的行
# df1 = df.dropna(axis = 0)
# df1.isna().sum()

#指定列
df1 = df.dropna(subset=['Cabin', 'Embarked'])
df1.isna().sum()
df1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 360 entries, 1 to 889
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  360 non-null    int64  
 1   Survived     360 non-null    int64  
 2   Pclass       360 non-null    int64  
 3   Name         360 non-null    object 
 4   Sex          360 non-null    object 
 5   Age          360 non-null    float64
 6   SibSp        360 non-null    int64  
 7   Parch        360 non-null    int64  
 8   Ticket       360 non-null    object 
 9   Fare         360 non-null    float64
 10  Cabin        360 non-null    object 
 11  Embarked     360 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 36.6+ KB
# DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)
# value   scalar, dict, Series, or DataFrame
# dict  可以指定每一行或列用什么值填充
# method   {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
# 在列上操作
# ffill / pad   使用前一个值来填充缺失值
# backfill / bfill   使用后一个值来填充缺失值
# limit 填充的缺失值个数限制
df.isna().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          529
Embarked         2
dtype: int64
#用0代替所有的缺失值
df2 = df.fillna(value=0)
df2.isna().sum()
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

判断重复值

# 判断重复值
df3 = df[df.duplicated()]  #没有参数,要全部一样才会判断重复值
df3
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
17000000.00000.000
19000000.00000.000
26000000.00000.000
28000000.00000.000
29000000.00000.000
.......................................
859000000.00000.000
863000000.00000.000
868000000.00000.000
878000000.00000.000
888000000.00000.000

176 rows × 12 columns

# 对重复值进行处理
df3 = df.drop_duplicates()  #删除数据记录中所有列值相同的记录
df3 

df4 = df.drop_duplicates(['Age','Parch'])    #删除数据记录中指定列值相同的记录
df4
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.0010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.0010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.0000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.001011380353.1000C123S
5000000.000000.000000
.......................................
83183212Richards, Master. George Sibleymale0.83112910618.7500NaNS
84384403Lemberopolous, Mr. Peter Lmale34.500026836.4375NaNC
85185203Svensson, Mr. Johanmale74.00003470607.7750NaNS
87187211Beckwith, Mrs. Richard Leonard (Sallie Monypeny)female47.00111175152.5542D35S
87988011Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)female56.00011176783.1583C50C

177 rows × 12 columns

df.to_csv('test_clear.csv')

分箱操作

#特征观察与处理
# 数值型特征一般可以直接用于模型的训练,但有时候为了模型的稳定性及鲁棒性会对连续变量进行离散化。文本型特征往往需要转换成数值型特征才能用于建模分析
#分箱操作:连续数据的离散化处理+
# 将连续变量Age平均分箱成5个年龄段,并分别用类别变量12345表示
df['AgeBand'] = pd.cut(df['Age'], 5 ,labels=[1,2,3,4,5])
df
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBand
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS2
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C3
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS2
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S3
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS3
..........................................
88688702Montvila, Rev. Juozasmale27.00021153613.0000NaNS2
88788811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S2
888000000.00000.0000001
88989011Behr, Mr. Karl Howellmale26.00011136930.0000C148C2
89089103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ2

891 rows × 13 columns

df
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBand
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS2
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C3
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS2
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S3
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS3
..........................................
88688702Montvila, Rev. Juozasmale27.00021153613.0000NaNS2
88788811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S2
888000000.00000.0000001
88989011Behr, Mr. Karl Howellmale26.00011136930.0000C148C2
89089103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ2

891 rows × 13 columns

# 将连续变量Age划分为(0,5] (5,15] (15,30] (30,50] (50,80]五个年龄段,并分别用类别变量12345表示
df['AgeBand1'] = pd.cut(df['Age'],[0,5,15,30,50,80],labels = [1,2,3,4,5]) 
df

#将连续变量Age按10% 30% 50 70% 90%五个年龄段,并用分类变量12345表示
# df['AgeBand2'] = pd.qcut(df['Age'],[0,0.1,0.3,0.5,0.7,0.9],labels = [1,2,3,4,5])
# df
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBandAgeBand1
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS23
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C34
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS23
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S34
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS34
.............................................
88688702Montvila, Rev. Juozasmale27.00021153613.0000NaNS23
88788811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S23
888000000.00000.0000001NaN
88989011Behr, Mr. Karl Howellmale26.00011136930.0000C148C23
89089103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ24

891 rows × 14 columns

# 对文本变量进行转换
# 查看类别文本变量名及种类
df['Sex'].value_counts()     # value_counts() 
male      453
female    261
0         177
Name: Sex, dtype: int64
df['Sex'].unique()    #unique
df['Sex'].nunique()   #文本变量名数量
3
# 将类别文本转换为12345
df['Sex_num'] = df['Sex'].replace(['male','female'],[1,2])   # replace替换类别文本一
df['Sex_num'].value_counts() 
1    453
2    261
0    177
Name: Sex_num, dtype: int64
df['Sex_num'] = df['Sex'].map({'male': 1, 'female': 2})  # map替换类别文本二
df['Sex_num'].value_counts() 
1.0    453
2.0    261
Name: Sex_num, dtype: int64
from sklearn.preprocessing import LabelEncoder  #使用sklearn.preprocessing的LabelEncoder替换类别文本三
for feat in ['Cabin', 'Ticket']:
    lbl = LabelEncoder()  
    label_dict = dict(zip(df[feat].unique(), range(df[feat].nunique())))
    df[feat + "_labelEncode"] = df[feat].map(label_dict)
    df[feat + "_labelEncode"] = lbl.fit_transform(df[feat].astype(str))
df.head(5)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBandAgeBand1Sex_numCabin_labelEncodeTicket_labelEncode
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS231.0135409
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C342.074472
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS232.0135533
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S342.05041
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS341.0135374
# 将文本变量Sex, Cabin, Embarked用one-hot编码表示
#将类别文本转换为one-hot编码

for feat in ["Age", "Embarked"]:             # OneHotEncoder
#     x = pd.get_dummies(df["Age"] // 6)  
#     x = pd.get_dummies(pd.cut(df['Age'],5))
    x = pd.get_dummies(df[feat], prefix=feat)
    df = pd.concat([df, x], axis=1)
    #df[feat] = pd.get_dummies(df[feat], prefix=feat)
    
df.head(5)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFare...Age_66.0Age_70.0Age_70.5Age_71.0Age_74.0Age_80.0Embarked_0Embarked_CEmbarked_QEmbarked_S
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500...0000000001
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833...0000000100
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250...0000000001
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000...0000000001
4503Allen, Mr. William Henrymale35.0003734508.0500...0000000001

5 rows × 110 columns

提取特征,正则表达式

# 从纯文本Name特征里提取出Titles的特征(所谓的Titles就是Mr,Miss,Mrs等)
# Series.str.extract(pat, flags=0, expand=True)
df['Title'] = df.Name.str.extract('([A-Za-z]+)\.', expand=False)   # str.extract()函数和正则表达式,可以处理数字、符号和字母混合的字符串
df.head(5)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFare...Age_70.0Age_70.5Age_71.0Age_74.0Age_80.0Embarked_0Embarked_CEmbarked_QEmbarked_STitle
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500...000000001Mr
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833...000000100Mrs
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250...000000001Miss
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000...000000001Mrs
4503Allen, Mr. William Henrymale35.0003734508.0500...000000001Mr

5 rows × 111 columns

df.to_csv('test_fin.csv')

本章共四节,本章第2.1节主要内容,包括数据的清洗及特征处理,缺失值和重复值的处理,连续数据的离散化,转换类别文本,正则表达式。

标签:泰坦尼克号,non,891,df,Age,Kaggle,从零开始,int64,null
来源: https://blog.csdn.net/weixin_45058606/article/details/122003899