首页 > 其他分享> > 使用随机森林与Xgboost回归对手游《率土之滨》账号进行价格预测

使用随机森林与Xgboost回归对手游《率土之滨》账号进行价格预测

2021-09-13 20:33:41 作者：互联网

前言

本文所采用的数据为2020年8月率土之滨藏宝阁的上架商品的数据。数据搜集过程在上一篇文章：使用python+Selenium动态爬取《率土之滨》藏宝阁账号信息_GreyLZ的博客-CSDN博客。获取的数据包括账号价格，武将数量，战法数量，宝物数量，武将卡牌，典藏数量，武将卡牌进阶数量。以账号价格为因变量，武将数量，战法数量，宝物数量，武将卡牌，典藏数量，武将卡牌进阶数量为自变量，其中武将数量，战法数量，宝物数量，武将卡牌进阶数量为离散型变量，武将卡牌为0-1变量（表示是否拥有此武将，1表示拥有，0表示不拥有）。一共有5005个样本，227个变量。所使用的预测方法是随机森林以及XGBoost。

1.数据预处理

将武将卡牌转化为0，1向量，武将卡牌为0-1变量（表示是否拥有此武将，1表示拥有，0表示不拥有），将武将进阶数量转化为矩阵，每一个武将进阶数量最大为5次。（战法是否拥有与武将卡牌的操作类似）

# 对每个账号的武将是否拥有进行编码，有：1，没有：0
# 读取武将库
with open('AllWJ.json', 'r') as AllWJ:
    ALLWJ = json.load(AllWJ)
# AllWJ0 = ALLWJ
# 创建武将矩阵
WJhave = np.zeros((numAccount,len(ALLWJ)))
# 武将进阶矩阵
WJJJ = np.zeros((numAccount,len(ALLWJ)))
for i in range(numAccount):
    k = i + 1
    # 读取每个账号的数据
    with open('data/account%d.json' % k, 'r') as Alist:
        ac = json.load(Alist)
    ac8 = np.array(ac[8]) # 武将进阶
    wujiang = ac[9]       # 武将拥有
    for jj in range(len(ALLWJ)):
        if ALLWJ[jj] in wujiang:
            WJhave[i, jj] = 1
            # index0 = wujiang.index(ALLWJ[jj])
            index0 = [i2 for i2, x in enumerate(wujiang) if x == ALLWJ[jj]]
            #进阶数量
            if sum(ac8[index0]) > 5:
                WJJJ[i,jj] = 5
            else:
                WJJJ[i, jj] = sum(ac8[index0])
        else:
            WJhave[i, jj] = 0

画出价格分布直方图

可以明显看出，因变量价格并不服从正态分布，因此需要进行Box-Cox变换。

from scipy import stats
# 对价格进行正态性检验
Price_ks_test = stats.kstest(Price, 'norm')
Price, _ = stats.boxcox(Price)  #对价格进行BoxCox变换

2.随机森林以及XGBoost预测价格

首先划分训练集与测试集。

# 分割数据集
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(Features,Price, test_size = 0.25,random_state = 123)

构造随机森林模型以及训练模型。

from sklearn.ensemble import RandomForestRegressor
# Instantiate model 随机森林
rf = RandomForestRegressor(n_estimators= 200, random_state=648)
# Train the model on training data
rf.fit(train_features, train_labels)

计算模型的准确率。

from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

pre_test = rf.predict(train_features)
predictions = rf.predict(test_features)

def TEST(predictions,test_labels):
    errors = abs(predictions - test_labels)
    R2 = r2_score(test_labels, predictions)
    print('R2:', R2)
    print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')
    # 均方误差
    MSE = mean_squared_error(test_labels, predictions)
    print('MSE:', MSE)

TEST(pre_test,train_labels)
TEST(predictions,test_labels)

其中R2是决定系数R-square ，Mean Absolute Error是平均绝对误差，MSE是均方误差。

接下来构造XGBoost模型，训练模型以及评估模型，模型效果如下表所示

# XGBoost模型
XGBRE = xgb.XGBRegressor(max_depth=5, learning_rate=0.1, n_estimators=200, silent=False, objective='reg:gamma')
XGBRE.fit(train_features, train_labels)

predictions2 = XGBRE.predict(test_features)
pre_test2 = XGBRE.predict(train_features)
TEST(pre_test2,train_labels)
TEST(predictions2,test_labels)

随机森林与Xgboost的模型效果比较
	决定系数R-square	均方误差MSE	平均绝对误差MAE
(随机森林)训练集	0.98371	192499.76	210.76
(随机森林)测试集	0.9023	1163607.211	544.63
(Xgboost)训练集	0.97365	311369.2249	302.76
(Xgboost)测试集	0.91767	980529.5865	509.24

决定系数R-square的取值（0-1），决定系数越大，说明自变量对因变量的解释程度越高，拟合效果越好。由表1可以看出，随机森林和Xgboost的回归决定系数R-square均大于0.9，达到好的拟合优度。

随机森林与XGBoost的plot scatter matrix：

接下来计算随机森林与XGBoost的特征重要性

# 武将importance排名
def WJFI(feature_importances,ALLWJ):
    FI = list()
    INPindex = 0
    for fi in feature_importances:
        if(fi[0] in ALLWJ):
            INPindex = INPindex + 1
            FI.append(list(fi))
            print('Variable%d:'%INPindex,fi[0],'Importance:',fi[1])
    return FI
# 所有变量importance排名
def Feature_Importances(fis,start,end):
    importances = list(fis)
    # sum(importances)
    # List of tuples with variable and importance
    feature_list = VarName[start:end]
    feature_importances = [(feature, round(importance, 5)) for feature, importance in zip(feature_list, importances)]
    # Sort the feature importances by most important first
    feature_importances = sorted(feature_importances, key=lambda x: x[1], reverse=True)
    wji = WJFI(feature_importances,ALLWJ)
    return feature_importances,wji
importances = list(rf.feature_importances_)
FI0,WJI0 = Feature_Importances(importances,F_start,len(VarName))

将结果弄成表格形式

随机森林与Xgboost特征重要性排在前7的变量以及重要性
	随机森林		Xgboost
	变量名	重要性	变量名	重要性
1	武将数量	0.3007	武将数量	0.07675
2	典藏数量	0.29562	典藏数量	0.06218
3	吴吕蒙进阶	0.03022	蜀马云禄进阶	0.02873
4	吴陆逊进阶	0.01486	魏夏侯惇进阶	0.02177
5	蜀庞统进阶	0.01384	群甄洛进阶	0.02082
6	蜀马云禄进阶	0.01237	吴周瑜进阶	0.01747
7	吴周瑜进阶	0.01053	吴吕蒙进阶	0.01568

随机森林与Xgboost武将特征重要性(前10)
	随机森林		Xgboost
	武将名字	重要性	武将名字	重要性
1	蜀刘备	0.00606	蜀刘备	0.01431
2	汉皇甫嵩	0.00312	群马超	0.01311
3	蜀关银屏	0.00257	吴吕蒙	0.01261
4	吴吕蒙	0.00224	蜀关银屏	0.01246
5	魏徐晃	0.0017	汉皇甫嵩	0.01193
6	魏荀攸	0.0013	吴孙权	0.01178
7	蜀马岱	0.00117	魏曹操	0.00972
8	蜀庞统	0.00114	魏张辽	0.00939
9	汉董卓	0.00095	汉张机	0.00872
10	群马超	0.00089	汉董卓	0.00864

武将特征重要性可以表现玩家对不同武将的偏好程度。蜀刘备在武将特征重要性排在首位，接着是汉皇甫嵩（排名分别为2和5）和蜀关银屏（排名分别为3和4），吴吕蒙排名4和3。

标签：labels,进阶,账号,Xgboost,feature,率土之滨,武将,test,importances
来源： https://blog.csdn.net/GreyLZ/article/details/120269612