首页 > 其他分享> > 零基础入门数据挖掘——二手车交易价格预测：baseline

零基础入门数据挖掘——二手车交易价格预测：baseline

2022-12-31 17:40:23 作者：互联网

数据形式

训练数据集具有的特征如下：

name - 汽车编码
regDate - 汽车注册时间
model - 车型编码
brand - 品牌
bodyType - 车身类型
fuelType - 燃油类型
gearbox - 变速箱
power - 汽车功率
kilometer - 汽车行驶公里
notRepairedDamage - 汽车有尚未修复的损坏
regionCode - 看车地区编码
seller - 销售方
offerType - 报价类型
creatDate - 广告发布时间
price - 汽车价格（目标列）
v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14'（根据汽车的评论、标签等大量信息得到的embedding向量）【人工构造匿名特征】

赛题要求采用mae作为评价指标

	`import pandas as pd`
	`import numpy as np`
	`import matplotlib.pyplot as plt`
	`import seaborn as sns`
	`import missingno as msno`
	`import scipy.stats as st`
	`import warnings`
	`warnings.filterwarnings('ignore')`
	`# 解决中文显示问题`
	`plt.rcParams['font.sans-serif'] = ['SimHei']`
	`plt.rcParams['axes.unicode_minus'] = False`

先读入数据：

train_data = pd.read_csv("used_car_train_20200313.csv", sep = " ")

用excel打开可以看到每一行数据都放下一个单元格中，彼此之间用空格分隔，因此此处需要指定sep为空格，才能够正确读入数据。

观看一下数据：

train_data.head(5).append(train_data.tail(5))

那么下面就开始对数据进行分析。

train_data.columns.values

	`array(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType',`
	`'fuelType', 'gearbox', 'power', 'kilometer', 'notRepairedDamage',`
	`'regionCode', 'seller', 'offerType', 'creatDate', 'price', 'v_0',`
	`'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9',`
	`'v_10', 'v_11', 'v_12', 'v_13', 'v_14'], dtype=object)`

以上为数据具有的具体特征，那么可以先初步探索一下每个特征的数值类型以及取值等。

train_data.info()

	`<class 'pandas.core.frame.DataFrame'>`
	`RangeIndex: 150000 entries, 0 to 149999`
	`Data columns (total 31 columns):`
	`# Column Non-Null Count Dtype`
	`--- ------ -------------- -----`
	`0 SaleID 150000 non-null int64`
	`1 name 150000 non-null int64`
	`2 regDate 150000 non-null int64`
	`3 model 149999 non-null float64`
	`4 brand 150000 non-null int64`
	`5 bodyType 145494 non-null float64`
	`6 fuelType 141320 non-null float64`
	`7 gearbox 144019 non-null float64`
	`8 power 150000 non-null int64`
	`9 kilometer 150000 non-null float64`
	`10 notRepairedDamage 150000 non-null object`
	`11 regionCode 150000 non-null int64`
	`12 seller 150000 non-null int64`
	`13 offerType 150000 non-null int64`
	`14 creatDate 150000 non-null int64`
	`15 price 150000 non-null int64`
	`16 v_0 150000 non-null float64`
	`17 v_1 150000 non-null float64`
	`18 v_2 150000 non-null float64`
	`19 v_3 150000 non-null float64`
	`20 v_4 150000 non-null float64`
	`21 v_5 150000 non-null float64`
	`22 v_6 150000 non-null float64`
	`23 v_7 150000 non-null float64`
	`24 v_8 150000 non-null float64`
	`25 v_9 150000 non-null float64`
	`26 v_10 150000 non-null float64`
	`27 v_11 150000 non-null float64`
	`28 v_12 150000 non-null float64`
	`29 v_13 150000 non-null float64`
	`30 v_14 150000 non-null float64`
	`dtypes: float64(20), int64(10), object(1)`
	`memory usage: 35.5+ MB`

可以看到除了notRepairedDamage是object类型，其他都是int或者float类型，同时可以看到部分特征还是存在缺失值的，因此这也是后续处理的重要方向。下面查看缺失值的情况：

train_data.isnull().sum()

可以看到是部分特征存在较多的缺失值的，因此这是需要处理的部分