其他分享
首页 > 其他分享> > (踏上有意义的旅程)数据探索--找出规律揭示奥秘

(踏上有意义的旅程)数据探索--找出规律揭示奥秘

作者:互联网

文章目录

1.数据集获取

1.1 数据集介绍
1.2 导入必要的工具包
1.3 数据读取

2.数据探索

2.1 查看标签的分布情况
2.2 查看缺失值
2.3 查看特征数据类型
2.4 异常点检测
2.5 相关性检验
2.6 年龄对还款的影响(探索)
2.7 外部数据源

文章正文

1. 数据集获取

1.1 数据集介绍

本次使用的数据集来源于Kaggle平台,由home credict提供,该公司的服务致力于向无银行账户的人群提供信贷。
**该数据集共8个文件**
我们只用到以下数据集:

1.2导入必要的工具包

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns',None)
plt.style.use('fivethirtyeight')

1.3数据读取

1.3.1 读取训练集
df_train=pd.read_csv('./application_train.csv')
df_train.head()
SK_ID_CURR	TARGET	NAME_CONTRACT_TYPE	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	CNT_CHILDREN	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	...	FLAG_DOCUMENT_18	FLAG_DOCUMENT_19	FLAG_DOCUMENT_20	FLAG_DOCUMENT_21	AMT_REQ_CREDIT_BUREAU_HOUR	AMT_REQ_CREDIT_BUREAU_DAY	AMT_REQ_CREDIT_BUREAU_WEEK	AMT_REQ_CREDIT_BUREAU_MON	AMT_REQ_CREDIT_BUREAU_QRT	AMT_REQ_CREDIT_BUREAU_YEAR
0	100002	1	Cash loans	M	N	Y	0	202500.0	406597.5	24700.5	...	0	0	0	0	0.0	0.0	0.0	0.0	0.0	1.0
1	100003	0	Cash loans	F	N	N	0	270000.0	1293502.5	35698.5	...	0	0	0	0	0.0	0.0	0.0	0.0	0.0	0.0
2	100004	0	Revolving loans	M	Y	Y	0	67500.0	135000.0	6750.0	...	0	0	0	0	0.0	0.0	0.0	0.0	0.0	0.0
3	100006	0	Cash loans	F	N	Y	0	135000.0	312682.5	29686.5	...	0	0	0	0	NaN	NaN	NaN	NaN	NaN	NaN
4	100007	0	Cash loans	M	N	Y	0	121500.0	513000.0	21865.5	...	0	0	0	0	0.0	0.0	0.0	0.0	0.0	0.0
5 rows × 122 columns
1.3.2 训练集数据整体概览
df_train.info()
# 输出信息
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB
1.3.3 读取测试集
df_test=pd.read_csv('./application_test.CSV')
df_test.info()
# 输出信息
SK_ID_CURR	NAME_CONTRACT_TYPE	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	CNT_CHILDREN	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	AMT_GOODS_PRICE	...	FLAG_DOCUMENT_18	FLAG_DOCUMENT_19	FLAG_DOCUMENT_20	FLAG_DOCUMENT_21	AMT_REQ_CREDIT_BUREAU_HOUR	AMT_REQ_CREDIT_BUREAU_DAY	AMT_REQ_CREDIT_BUREAU_WEEK	AMT_REQ_CREDIT_BUREAU_MON	AMT_REQ_CREDIT_BUREAU_QRT	AMT_REQ_CREDIT_BUREAU_YEAR
0	100001	Cash loans	F	N	Y	0	135000.0	568800.0	20560.5	450000.0	...	0	0	0	0	0.0	0.0	0.0	0.0	0.0	0.0
1	100005	Cash loans	M	N	Y	0	99000.0	222768.0	17370.0	180000.0	...	0	0	0	0	0.0	0.0	0.0	0.0	0.0	3.0
2	100013	Cash loans	M	Y	Y	0	202500.0	663264.0	69777.0	630000.0	...	0	0	0	0	0.0	0.0	0.0	0.0	1.0	4.0
3	100028	Cash loans	F	N	Y	2	315000.0	1575000.0	49018.5	1575000.0	...	0	0	0	0	0.0	0.0	0.0	0.0	0.0	3.0
4	100038	Cash loans	M	Y	N	1	180000.0	625500.0	32067.0	625500.0	...	0	0	0	0	NaN	NaN	NaN	NaN	NaN	NaN
5 rows × 121 columns
1.3.4 测试集数据整体概览
df_test.info()
# 输出信息
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(40), object(16)
memory usage: 45.0+ MB

2. 数据探索

2.1 查看标签的分布情况

df_train['TARGET'].value_counts()
# 输出信息
0    282686
1     24825
Name: TARGET, dtype: int64
df_train['TARGET'].plot.hist()

标签分布情况

2.2 查看缺失值

def missing_values(df):
	missing_val=df.isnull().sum()
	missing_val_percent=100*df.isnull().sum()/len(df)
	missing_val_data=pd.concat([missing_val,missing_val_percent],axis=1)
	missing_val_data=missing_val_data.rename(columns={0:'Missing Values',1:'% of Total Values'})
	missing_val_data=missing_val_data[missing_val_data.iloc[:,1]!=0].sort_values('% of Total Values',ascending=False).round(1)
print('数据集有:'+str(df.shape[1])+'列,''其中'+str(missing_value_data.shape[0])+'列存在缺失值')
return missing_val_data
# 也可以实时编写一段:
missing_val=df_train.isnull().sum()
missing_val_percent=100*missing_val/len(df_train)
missing_val_data=pd.concat([missing_val,missing_val_percent],axis=1)
missing_val_data=missing_val_data.rename(columns={0:'Missing Values',1:'% of Total Values'})
missing_val_data=missing_val_data[missing_val_data.iloc[:,1]!=0].sort_values('% of Total Values',ascending=False).round(2)
# 输出结果:
	Missing Values	% of Total Values
COMMONAREA_MEDI	214865	69.87
COMMONAREA_AVG	214865	69.87
COMMONAREA_MODE	214865	69.87
NONLIVINGAPARTMENTS_MEDI	213514	69.43
NONLIVINGAPARTMENTS_MODE	213514	69.43
...	...	...
OBS_60_CNT_SOCIAL_CIRCLE	1021	0.33
DEF_60_CNT_SOCIAL_CIRCLE	1021	0.33
OBS_30_CNT_SOCIAL_CIRCLE	1021	0.33
EXT_SOURCE_2	660	0.21
AMT_GOODS_PRICE	278	0.09
64 rows × 2 columns
# 缺失值有多少列
missing_val_data.shape[0]
64
missing_val_data[missing_val_data['% of Total Values']>20]
# 输出结果
Missing Values	% of Total Values
COMMONAREA_MEDI	214865	69.87
COMMONAREA_AVG	214865	69.87
COMMONAREA_MODE	214865	69.87
NONLIVINGAPARTMENTS_MEDI	213514	69.43
NONLIVINGAPARTMENTS_MODE	213514	69.43
NONLIVINGAPARTMENTS_AVG	213514	69.43
FONDKAPREMONT_MODE	210295	68.39
LIVINGAPARTMENTS_MODE	210199	68.35
LIVINGAPARTMENTS_MEDI	210199	68.35
LIVINGAPARTMENTS_AVG	210199	68.35
FLOORSMIN_MODE	208642	67.85
FLOORSMIN_MEDI	208642	67.85
FLOORSMIN_AVG	208642	67.85
YEARS_BUILD_MODE	204488	66.50
YEARS_BUILD_MEDI	204488	66.50
YEARS_BUILD_AVG	204488	66.50
OWN_CAR_AGE	202929	65.99
LANDAREA_AVG	182590	59.38
LANDAREA_MEDI	182590	59.38
LANDAREA_MODE	182590	59.38
BASEMENTAREA_MEDI	179943	58.52
BASEMENTAREA_AVG	179943	58.52
BASEMENTAREA_MODE	179943	58.52
EXT_SOURCE_1	173378	56.38
NONLIVINGAREA_MEDI	169682	55.18
NONLIVINGAREA_MODE	169682	55.18
NONLIVINGAREA_AVG	169682	55.18
ELEVATORS_MEDI	163891	53.30
ELEVATORS_MODE	163891	53.30
ELEVATORS_AVG	163891	53.30
WALLSMATERIAL_MODE	156341	50.84
APARTMENTS_MODE	156061	50.75
APARTMENTS_MEDI	156061	50.75
APARTMENTS_AVG	156061	50.75
ENTRANCES_MODE	154828	50.35
ENTRANCES_AVG	154828	50.35
ENTRANCES_MEDI	154828	50.35
LIVINGAREA_MEDI	154350	50.19
LIVINGAREA_MODE	154350	50.19
LIVINGAREA_AVG	154350	50.19
HOUSETYPE_MODE	154297	50.18
FLOORSMAX_MEDI	153020	49.76
FLOORSMAX_AVG	153020	49.76
FLOORSMAX_MODE	153020	49.76
YEARS_BEGINEXPLUATATION_AVG	150007	48.78
YEARS_BEGINEXPLUATATION_MEDI	150007	48.78
YEARS_BEGINEXPLUATATION_MODE	150007	48.78
TOTALAREA_MODE	148431	48.27
EMERGENCYSTATE_MODE	145755	47.40
OCCUPATION_TYPE	96391	31.35

2.3 查看特征数据类型

df_train.dtypes
# 输出结果:
SK_ID_CURR                      int64
TARGET                          int64
NAME_CONTRACT_TYPE             object
CODE_GENDER                    object
FLAG_OWN_CAR                   object
                               ...   
AMT_REQ_CREDIT_BUREAU_DAY     float64
AMT_REQ_CREDIT_BUREAU_WEEK    float64
AMT_REQ_CREDIT_BUREAU_MON     float64
AMT_REQ_CREDIT_BUREAU_QRT     float64
AMT_REQ_CREDIT_BUREAU_YEAR    float64
Length: 122, dtype: object
df_train.dtypes.values()
# 输出结果:
float64    65
int64      41
object     16
dtype: int64
df_train.select_dtypes('object')
# 输出结果:
	NAME_CONTRACT_TYPE	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	NAME_TYPE_SUITE	NAME_INCOME_TYPE	NAME_EDUCATION_TYPE	NAME_FAMILY_STATUS	NAME_HOUSING_TYPE	OCCUPATION_TYPE	WEEKDAY_APPR_PROCESS_START	ORGANIZATION_TYPE	FONDKAPREMONT_MODE	HOUSETYPE_MODE	WALLSMATERIAL_MODE	EMERGENCYSTATE_MODE
0	Cash loans	M	N	Y	Unaccompanied	Working	Secondary / secondary special	Single / not married	House / apartment	Laborers	WEDNESDAY	Business Entity Type 3	reg oper account	block of flats	Stone, brick	No
1	Cash loans	F	N	N	Family	State servant	Higher education	Married	House / apartment	Core staff	MONDAY	School	reg oper account	block of flats	Block	No
2	Revolving loans	M	Y	Y	Unaccompanied	Working	Secondary / secondary special	Single / not married	House / apartment	Laborers	MONDAY	Government	NaN	NaN	NaN	NaN
3	Cash loans	F	N	Y	Unaccompanied	Working	Secondary / secondary special	Civil marriage	House / apartment	Laborers	WEDNESDAY	Business Entity Type 3	NaN	NaN	NaN	NaN
4	Cash loans	M	N	Y	Unaccompanied	Working	Secondary / secondary special
df_train.select_dtypes('object').apply(pd.Series.nunique,axis=0)
# 输出结果:
NAME_CONTRACT_TYPE             2
CODE_GENDER                    3
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               8
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             6
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             58
FONDKAPREMONT_MODE             4
HOUSETYPE_MODE                 3
WALLSMATERIAL_MODE             7
EMERGENCYSTATE_MODE            2
dtype: int64

2.4 异常点检测

标签:val,旅程,missing,--,0.0,df,MODE,AMT,奥秘
来源: https://blog.csdn.net/weixin_42961082/article/details/113825176