其他分享
首页 > 其他分享> > 集成学习-蒸汽量预测(DataWhale第二期)

集成学习-蒸汽量预测(DataWhale第二期)

作者:互联网

集成学习案例二 (蒸汽量预测)

背景介绍

火力发电的基本原理是:燃料在燃烧时加热水生成蒸汽,蒸汽压力推动汽轮机旋转,然后汽轮机带动发电机旋转,产生电能。在这一系列的能量转化中,影响发电效率的核心是锅炉的燃烧效率,即燃料燃烧加热水产生高温高压蒸汽。锅炉的燃烧效率的影响因素很多,包括锅炉的可调参数,如燃烧给量,一二次风,引风,返料风,给水水量;以及锅炉的工况,比如锅炉床温、床压,炉膛温度、压力,过热器的温度等。我们如何使用以上的信息,根据锅炉的工况,预测产生的蒸汽量,来为我国的工业届的产量预测贡献自己的一份力量呢?

所以,该案例是使用以上工业指标的特征,进行蒸汽量的预测问题。由于信息安全等原因,我们使用的是经脱敏后的锅炉传感器采集的数据(采集频率是分钟级别)。

数据信息

数据分成训练数据(train.txt)和测试数据(test.txt),其中字段”V0”-“V37”,这38个字段是作为特征变量,”target”作为目标变量。我们需要利用训练数据训练出模型,预测测试数据的目标变量。

评价指标

最终的评价指标为均方误差MSE,即:
S c o r e = 1 n ∑ 1 n ( y i − y ∗ ) 2 Score = \frac{1}{n} \sum_1 ^n (y_i - y ^*)^2 Score=n1​1∑n​(yi​−y∗)2

import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import seaborn as sns

# 模型
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, RepeatedKFold, cross_val_score,cross_val_predict,KFold
from sklearn.metrics import make_scorer,mean_squared_error
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.svm import LinearSVR, SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor,AdaBoostRegressor
from xgboost import XGBRegressor
from sklearn.preprocessing import PolynomialFeatures,MinMaxScaler,StandardScaler
data_train = pd.read_csv('train.txt',sep = '\t')
data_test = pd.read_csv('test.txt',sep = '\t')
#合并训练数据和测试数据
data_train["oringin"]="train"
data_test["oringin"]="test"
data_all=pd.concat([data_train,data_test],axis=0,ignore_index=True)
#显示前5条数据
data_all.head()
V0V1V10V11V12V13V14V15V16V17...V36V37V4V5V6V7V8V9oringintarget
00.5660.016-0.940-0.307-0.0730.550-0.4840.000-1.707-1.162...-2.608-3.5080.452-0.901-1.812-2.360-0.436-2.114train0.175
10.9680.4370.188-0.455-0.1341.109-0.4880.000-0.977-1.162...-0.335-0.7300.194-0.893-1.566-2.3600.332-2.114train0.676
21.0130.5680.874-0.051-0.0720.767-0.493-0.212-0.618-0.897...0.765-0.5890.112-0.797-1.367-2.3600.396-2.114train0.633
30.7330.3680.0110.102-0.0140.769-0.371-0.162-0.429-0.897...0.333-0.1120.599-0.679-1.200-2.0860.403-2.114train0.206
40.6840.638-0.2510.5700.199-0.349-0.342-0.138-0.391-0.897...-0.280-0.0280.337-0.454-1.073-2.0860.314-2.114train0.384

5 rows × 40 columns

data_train.corr()
V0V1V2V3V4V5V6V7V8V9...V29V30V31V32V33V34V35V36V37target
V01.0000000.9086070.4636430.4095760.781212-0.3270280.1892670.1412940.7940130.077888...0.3021450.1569680.6750030.0509510.056439-0.0193420.1389330.231417-0.4940760.873212
V10.9086071.0000000.5065140.3839240.657790-0.2272890.2768050.2050230.8746500.138849...0.1470960.1759970.7697450.0856040.035129-0.0291150.1463290.235299-0.4940430.871846
V20.4636430.5065141.0000000.4101480.057697-0.3224170.6159380.4771140.7034310.047874...-0.2757640.1759430.6537640.0339420.050309-0.0256200.0436480.316462-0.7349560.638878
V30.4095760.3839240.4101481.0000000.315046-0.2063070.2338960.1978360.411946-0.063717...0.1176100.0439660.421954-0.092423-0.007159-0.0318980.0800340.324475-0.2296130.512074
V40.7812120.6577900.0576970.3150461.000000-0.233959-0.117529-0.0523700.449542-0.031816...0.6590930.0228070.447016-0.0261860.0623670.0286590.1000100.113609-0.0310540.603984
V5-0.327028-0.227289-0.322417-0.206307-0.2339591.000000-0.0289950.081069-0.1822810.038810...-0.175836-0.074214-0.121290-0.061886-0.132727-0.105801-0.0751910.0265960.404799-0.314676
V60.1892670.2768050.6159380.233896-0.117529-0.0289951.0000000.9175020.4682330.450096...-0.4679800.1889070.5465350.1445500.054210-0.0029140.0449920.433804-0.4048170.370037
V70.1412940.2050230.4771140.197836-0.0523700.0810690.9175021.0000000.3899870.446611...-0.3113630.1701130.4752540.1227070.034508-0.0191030.1111660.340479-0.2922850.287815
V80.7940130.8746500.7034310.4119460.449542-0.1822810.4682330.3899871.0000000.100672...-0.0110910.1502580.8780720.0384300.026843-0.0362970.1791670.326586-0.5531210.831904
V90.0778880.1388490.047874-0.063717-0.0318160.0388100.4500960.4466110.1006721.000000...-0.2216230.2930260.1217120.2898910.1156550.0948560.1417030.129542-0.1125030.139704
V100.2984430.3101200.3460060.3212620.1411290.0540600.4156600.3109820.4197030.120208...-0.105042-0.0367050.560213-0.0932130.016739-0.0269940.0268460.922190-0.0458510.394767
V11-0.295420-0.197317-0.256407-0.100489-0.1625070.863890-0.147990-0.064402-0.146689-0.114374...-0.084938-0.153304-0.084298-0.153126-0.095359-0.053865-0.0329510.0034130.459867-0.263988
V120.7518300.6561860.0599410.3063970.927685-0.306672-0.087312-0.0367910.420557-0.011889...0.6667750.0288660.441963-0.0076580.0466740.0101220.0819630.112150-0.0548270.594189
V130.1851440.1575180.204762-0.0036360.075993-0.4145170.1383670.1109730.153299-0.040705...0.0082350.0273280.1137430.1305980.1575130.1169440.219906-0.024751-0.3797140.203373
V14-0.004144-0.006268-0.106282-0.2326770.023853-0.0156710.0729110.1639310.0081380.118176...0.056814-0.0040570.0109890.1065810.0735350.0432180.233523-0.0862170.0105530.008424
V150.3145200.164702-0.2245730.1434570.615704-0.195037-0.431542-0.2912720.018366-0.199159...0.951314-0.1113110.011768-0.1046180.0502540.0486020.100817-0.0518610.2456350.154020
V160.3473570.4356060.7824740.3945170.023818-0.0445430.8471190.7526830.6800310.193681...-0.3422100.1547940.7785380.0414740.028878-0.0547750.0822930.551880-0.4200530.536748
V170.0447220.072619-0.0190080.1239000.0448030.3482110.1347150.2394480.1120530.167310...0.004855-0.0107870.150118-0.051377-0.055996-0.0645330.0723200.3127510.0458420.104605
V180.1486220.1238620.1321050.0228680.136022-0.1901970.1105700.0986910.0936820.260079...0.0539580.4703410.0797180.4119670.5121390.3654100.1520880.019603-0.1819370.170721
V19-0.100294-0.092673-0.161802-0.246008-0.2057290.1716110.2152900.158371-0.1446930.358149...-0.2054090.100133-0.1315420.144018-0.021517-0.079753-0.2207370.0876050.012115-0.114976
V200.4624930.4597950.2983850.2895940.291309-0.0732320.1360910.0893990.4128680.116111...0.0162330.0861650.3268630.0506990.009358-0.0009790.0489810.161315-0.3220060.444965
V21-0.029285-0.012911-0.0309320.1143730.1740250.115553-0.051806-0.065300-0.047839-0.018681...0.157097-0.0779450.053025-0.159128-0.087561-0.053707-0.1993980.0473400.315470-0.010063
V22-0.105643-0.102421-0.212023-0.291236-0.0285340.146545-0.0681580.077358-0.0979080.098401...0.053349-0.039953-0.1080880.057179-0.019107-0.0020950.205423-0.1306070.099282-0.107813
V230.2311360.2225740.0655090.0813740.196530-0.1584410.0699010.1251800.1741240.380050...0.1161220.3639630.1297830.3670860.1836660.1966810.635252-0.035949-0.1875820.226331
V24-0.324959-0.2335560.010225-0.237326-0.5298660.2754800.072418-0.030292-0.136898-0.008549...-0.6423700.033532-0.2020970.060608-0.134320-0.095588-0.243738-0.041325-0.137614-0.264815
V25-0.200706-0.0706270.481785-0.100569-0.4443750.0455510.4386100.3167440.1733200.078928...-0.5751540.0882380.2012430.065501-0.013312-0.030747-0.0939480.069302-0.246742-0.019373
V26-0.125140-0.0430120.035370-0.027685-0.0804870.2949340.1060550.1605660.0157240.128494...-0.133694-0.0572470.062879-0.004545-0.0345960.0512940.0855760.0649630.010880-0.046724
V270.7331980.8241980.7262500.3920060.412083-0.2184950.4744410.4241850.9011000.114315...-0.0327720.2080740.7902390.0951270.030135-0.0361230.1598840.226713-0.6177710.812585
V280.0351190.0773460.2295750.159039-0.044620-0.0422100.0934270.0588000.122050-0.064595...-0.1545720.0545460.1234030.013142-0.024866-0.058462-0.0802370.061601-0.1493260.100080
V290.3021450.147096-0.2757640.1176100.659093-0.175836-0.467980-0.311363-0.011091-0.221623...1.000000-0.122817-0.004364-0.1106990.0352720.0353920.078588-0.0993090.2855810.123329
V300.1569680.1759970.1759430.0439660.022807-0.0742140.1889070.1701130.1502580.293026...-0.1228171.0000000.1143180.6957250.083693-0.028573-0.0279870.006961-0.2568140.187311
V310.6750030.7697450.6537640.4219540.447016-0.1212900.5465350.4752540.8780720.121712...-0.0043640.1143181.0000000.0167820.016733-0.0472730.1523140.510851-0.3577850.750297
V320.0509510.0856040.033942-0.092423-0.026186-0.0618860.1445500.1227070.0384300.289891...-0.1106990.6957250.0167821.0000000.1052550.0693000.016901-0.054411-0.1624170.066606
V330.0564390.0351290.050309-0.0071590.062367-0.1327270.0542100.0345080.0268430.115655...0.0352720.0836930.0167330.1052551.0000000.7191260.1675970.031586-0.0627150.077273
V34-0.019342-0.029115-0.025620-0.0318980.028659-0.105801-0.002914-0.019103-0.0362970.094856...0.035392-0.028573-0.0472730.0693000.7191261.0000000.233616-0.019032-0.006854-0.006034
V350.1389330.1463290.0436480.0800340.100010-0.0751910.0449920.1111660.1791670.141703...0.078588-0.0279870.1523140.0169010.1675970.2336161.0000000.025401-0.0779910.140294
V360.2314170.2352990.3164620.3244750.1136090.0265960.4338040.3404790.3265860.129542...-0.0993090.0069610.510851-0.0544110.031586-0.0190320.0254011.000000-0.0394780.319309
V37-0.494076-0.494043-0.734956-0.229613-0.0310540.404799-0.404817-0.292285-0.553121-0.112503...0.285581-0.256814-0.357785-0.162417-0.062715-0.006854-0.077991-0.0394781.000000-0.565795
target0.8732120.8718460.6388780.5120740.603984-0.3146760.3700370.2878150.8319040.139704...0.1233290.1873110.7502970.0666060.077273-0.0060340.1402940.319309-0.5657951.000000

39 rows × 39 columns

相关性系数统计方式

在这里插入图片描述

sns.heatmap(data_train.corr())
<matplotlib.axes._subplots.AxesSubplot at 0x220a753d888>

在这里插入图片描述

探索数据分布

这里因为是传感器的数据,即连续变量,所以使用 kdeplot(核密度估计图) 进行数据的初步分析,即EDA。

for column in data_all.columns[0:-2]:
    #核密度估计(kernel density estimation)是在概率论中用来估计未知的密度函数,属于非参数检验方法之一。通过核密度估计图可以比较直观的看出数据样本本身的分布特征。
    g = sns.kdeplot(data_all[column][(data_all["oringin"] == "train")], color="Red", shade = True)
    g = sns.kdeplot(data_all[column][(data_all["oringin"] == "test")], ax =g, color="Blue", shade= True)
    g.set_xlabel(column)
    g.set_ylabel("Frequency")
    g = g.legend(["train","test"])
    plt.show()

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

在这里插入图片描述
在这里插入图片描述

从以上的图中可以看出特征"V5",“V9”,“V11”,“V17”,“V22”,"V28"中训练集数据分布和测试集数据分布不均,所以我们删除这些特征数据

查看特征之间的相关性(相关程度)

data_train1=data_all[data_all["oringin"]=="train"].drop("oringin",axis=1)
plt.figure(figsize=(20, 16))  # 指定绘图对象宽度和高度
colnm = data_train1.columns.tolist()  # 列表头
mcorr = data_train1[colnm].corr(method="spearman")  # 相关系数矩阵,即给出了任意两个变量之间的相关系数
mask = np.zeros_like(mcorr, dtype=np.bool)  # 构造与mcorr同维数矩阵 为bool型
mask[np.triu_indices_from(mask)] = True  # 角分线右侧为True
cmap = sns.diverging_palette(220, 10, as_cmap=True)  # 返回matplotlib colormap对象,调色板
g = sns.heatmap(mcorr, mask=mask, cmap=cmap, square=True, annot=True, fmt='0.2f')  # 热力图(看两两相似度)
plt.show()

在这里插入图片描述

threshold = 0.1
corr_matrix = data_train1.corr().abs()
drop_col=corr_matrix[corr_matrix["target"]<threshold].index
data_all.drop(drop_col,axis=1,inplace=True)

进行降维操作,即将相关性的绝对值小于阈值的特征进行删除

进行归一化操作

cols_numeric=list(data_all.columns)
cols_numeric.remove("oringin")
def scale_minmax(col):
    return (col-col.min())/(col.max()-col.min())
scale_cols = [col for col in cols_numeric if col!='target']
data_all[scale_cols] = data_all[scale_cols].apply(scale_minmax,axis=0)
data_all[scale_cols].describe()
V0V1V10V11V12V13V15V16V17V18...V31V35V36V37V4V5V6V7V8V9
count4813.0000004813.0000004813.0000004813.0000004813.0000004813.0000004813.0000004813.0000004813.0000004813.000000...4813.0000004813.0000004813.0000004813.0000004813.0000004813.0000004813.0000004813.0000004813.0000004813.000000
mean0.6941720.7213570.3485180.5172870.5785070.6123720.4022510.6792940.4677530.446542...0.7927090.7628730.3323850.5457950.5237430.4072460.7488230.7457400.7156070.879536
std0.1441980.1314430.1348820.1636970.1050880.1498350.1385610.1120950.1725500.124627...0.1029760.1020370.1274560.1503560.1064300.1866360.1325600.1325770.1181050.068244
min0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
25%0.6266760.6794160.2843270.4429530.5328920.5199280.2990160.6294140.3129740.399302...0.7618160.7272730.2705840.4456470.4781820.2984320.6833240.6969380.6649340.852903
50%0.7294880.7524970.3664690.5295470.5916350.6278090.3914370.7002580.4962040.456256...0.8150550.8000200.3470560.5393170.5358660.3824190.7741250.7719740.7428840.882377
75%0.7901950.7995530.4329650.5969880.6419710.7199580.4899540.7532790.5724640.501745...0.8522290.8000200.4148610.6430610.5850360.4602460.8422590.8364050.7908350.941189
max1.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.000000...1.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.000000

8 rows × 31 columns

模型构建以及集成学习

构建训练集和测试集

rmse、mse的评价函数

寻找离群值,并删除

岭回归

模型测试

标签:...,第二期,DataWhale,...-,train,蒸汽,import,data,col
来源: https://blog.csdn.net/m0_57446978/article/details/119278701