其他分享
首页 > 其他分享> > 20210325_23期_心跳检测_Task04_模型调参

20210325_23期_心跳检测_Task04_模型调参

作者:互联网

四、模型调参

在这里插入图片描述

目录

来源

Datewhle23期__数据挖掘心跳检测 :
https://github.com/datawhalechina/team-learning-data-mining/tree/master/HeartbeatClassification
作者:鱼佬、杜晓东、张晋、王皓月、牧小熊、姚昱君、杨梦迪

论坛地址:http://datawhale.club/t/topic/1574

import pandas as pd
import numpy as np
from sklearn.metrics import f1_score

import os
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

1 模型及调参

1.1 各类模型及介绍

1.1.1 逻辑回归模型((Logistic regression,LR))

关于逻辑回归详细文章: https://zhuanlan.zhihu.com/p/74874291



优点

缺点

1.1.2 决策树模型(decision tree)

决策树(decision tree)是一种基本的分类与回归方法,是一种非参数的有监督学习方法。
包括:

决策树学习是利用训练数据,根据损失函数最小化的原则建立决策树模型。决策树学习通常包括三个步骤:特征选择、决策树的生成和决策树的剪枝。


1.1.3 集成模型集成方法(ensemble method)


Baggin和Boosting的区别总结如下:


1.2 模型评估

数据集划分两个条件:

数据集划分的三种常见方式: 留出法、交叉验证法、自助法


2 Task4代码

代码在Ai Studio环境运行

# 读取数据
data = pd.read_csv("/home/aistudio/data/data76615/train.csv")
# 简单预处理
data_list = []
for items in data.values:
    data_list.append([items[0]] + [float(i) for i in items[1].split(',')] + [items[2]])#拆分

data = pd.DataFrame(np.array(data_list))
data.columns = ['id'] + ['s_'+str(i) for i in range(len(data_list[0])-2)] + ['label']

data = reduce_mem_usage(data)  # 减小数据内存

from sklearn.model_selection import KFold
# 分离数据集,方便进行交叉验证
X_train = data.drop(['id','label'], axis=1)
y_train = data['label']

# 5折交叉验证
folds = 5
seed = 2021
kf = KFold(n_splits=folds, shuffle=True, random_state=seed)

def f1_score_vali(preds, data_vali):
    labels = data_vali.get_label()
    preds = np.argmax(preds.reshape(4, -1), axis=0)
    score_vali = f1_score(y_true=labels, y_pred=preds, average='macro')
    return 'f1_score', score_vali, True

"""对训练集数据进行划分,分成训练集和验证集,并进行相应的操作"""
from sklearn.model_selection import train_test_split
import lightgbm as lgb
# 数据集划分
X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2)
train_matrix = lgb.Dataset(X_train_split, label=y_train_split)
valid_matrix = lgb.Dataset(X_val, label=y_val)

params = {
    "learning_rate": 0.1,
    "boosting": 'gbdt',  
    "lambda_l2": 0.1,
    "max_depth": -1,
    "num_leaves": 128,
    "bagging_fraction": 0.8,
    "feature_fraction": 0.8,
    "metric": None,
    "objective": "multiclass",
    "num_class": 4,
    "nthread": 10,
    "verbose": -1,
}

"""使用训练集数据进行模型训练"""
model = lgb.train(params, 
                  train_set=train_matrix, 
                  valid_sets=valid_matrix, 
                  num_boost_round=2000, 
                  verbose_eval=50, 
                  early_stopping_rounds=200,
                  feval=f1_score_vali)

在这里插入图片描述

val_pre_lgb = model.predict(X_val, num_iteration=model.best_iteration)
preds = np.argmax(val_pre_lgb, axis=1)
score = f1_score(y_true=y_val, y_pred=preds, average='macro')
print('未调参前lightgbm单模型在验证集上的f1:{}'.format(score))
未调参前lightgbm单模型在验证集上的f1:0.9680804558208395

"""使用lightgbm 5折交叉验证进行建模预测"""
cv_scores = []
for i, (train_index, valid_index) in enumerate(kf.split(X_train, y_train)):
    print('************************************ {} ************************************'.format(str(i+1)))
    X_train_split, y_train_split, X_val, y_val = X_train.iloc[train_index], y_train[train_index], X_train.iloc[valid_index], y_train[valid_index]
    
    train_matrix = lgb.Dataset(X_train_split, label=y_train_split)
    valid_matrix = lgb.Dataset(X_val, label=y_val)

    params = {
                "learning_rate": 0.1,
                "boosting": 'gbdt',  
                "lambda_l2": 0.1,
                "max_depth": -1,
                "num_leaves": 128,
                "bagging_fraction": 0.8,
                "feature_fraction": 0.8,
                "metric": None,
                "objective": "multiclass",
                "num_class": 4,
                "nthread": 10,
                "verbose": -1,
            }
    
    model = lgb.train(params, 
                      train_set=train_matrix, 
                      valid_sets=valid_matrix, 
                      num_boost_round=2000, 
                      verbose_eval=100, 
                      early_stopping_rounds=200,
                      feval=f1_score_vali)
    
    val_pred = model.predict(X_val, num_iteration=model.best_iteration)
    
    val_pred = np.argmax(val_pred, axis=1)
    cv_scores.append(f1_score(y_true=y_val, y_pred=val_pred, average='macro'))
    print(cv_scores)

print("lgb_scotrainre_list:{}".format(cv_scores))
print("lgb_score_mean:{}".format(np.mean(cv_scores)))
print("lgb_score_std:{}".format(np.std(cv_scores)))
*********************************** 2 ************************************
Training until validation scores don't improve for 200 rounds
[100]	valid_0's multi_logloss: 0.0472933	valid_0's f1_score: 0.965828
[200]	valid_0's multi_logloss: 0.0514952	valid_0's f1_score: 0.968138
Early stopping, best iteration is:
[87]	valid_0's multi_logloss: 0.0467472	valid_0's f1_score: 0.96567
[0.9674515729721614, 0.9656700872844327]
************************************ 3 ************************************
Training until validation scores don't improve for 200 rounds
[100]	valid_0's multi_logloss: 0.0378154	valid_0's f1_score: 0.971004
[200]	valid_0's multi_logloss: 0.0405053	valid_0's f1_score: 0.973736
Early stopping, best iteration is:
[93]	valid_0's multi_logloss: 0.037734	valid_0's f1_score: 0.970004
[0.9674515729721614, 0.9656700872844327, 0.9700043639844769]
************************************ 4 ************************************
Training until validation scores don't improve for 200 rounds
[100]	valid_0's multi_logloss: 0.0495142	valid_0's f1_score: 0.967106
[200]	valid_0's multi_logloss: 0.0542324	valid_0's f1_score: 0.969746
Early stopping, best iteration is:
[84]	valid_0's multi_logloss: 0.0490886	valid_0's f1_score: 0.965566
[0.9674515729721614, 0.9656700872844327, 0.9700043639844769, 0.9655663272378014]
************************************ 5 ************************************
Training until validation scores don't improve for 200 rounds
[100]	valid_0's multi_logloss: 0.0412544	valid_0's f1_score: 0.964054
[200]	valid_0's multi_logloss: 0.0443025	valid_0's f1_score: 0.965507
Early stopping, best iteration is:
[96]	valid_0's multi_logloss: 0.0411855	valid_0's f1_score: 0.963114
[0.9674515729721614, 0.9656700872844327, 0.9700043639844769, 0.9655663272378014, 0.9631137190307674]
lgb_scotrainre_list:[0.9674515729721614, 0.9656700872844327, 0.9700043639844769, 0.9655663272378014, 0.9631137190307674]
lgb_score_mean:0.9663612141019279
lgb_score_std:0.0022854824074775683


def rf_cv_lgb(num_leaves, max_depth, bagging_fraction, feature_fraction, bagging_freq, min_data_in_leaf, 
                min_child_weight, min_split_gain, reg_lambda, reg_alpha):
      # 建立模型
      model_lgb = lgb.LGBMClassifier(boosting_type='gbdt', objective='multiclass', num_class=4,
                                     learning_rate=0.1, n_estimators=5000,
                                     num_leaves=int(num_leaves), max_depth=int(max_depth), 
                                     bagging_fraction=round(bagging_fraction, 2), feature_fraction=round(feature_fraction, 2),
                                     bagging_freq=int(bagging_freq), min_data_in_leaf=int(min_data_in_leaf),
                                     min_child_weight=min_child_weight, min_split_gain=min_split_gain,
                                     reg_lambda=reg_lambda, reg_alpha=reg_alpha,
                                     n_jobs= 8
                                    )
      f1 = make_scorer(f1_score, average='micro')
      val = cross_val_score(model_lgb, X_train_split, y_train_split, cv=5, scoring=f1).mean()

      return val

  from bayes_opt import BayesianOptimization
  """定义优化参数"""
  bayes_lgb = BayesianOptimization(
      rf_cv_lgb, 
      {
          'num_leaves':(10, 200),
          'max_depth':(3, 20),
          'bagging_fraction':(0.5, 1.0),
          'feature_fraction':(0.5, 1.0),
          'bagging_freq':(0, 100),
          'min_data_in_leaf':(10,100),
          'min_child_weight':(0, 10),
          'min_split_gain':(0.0, 1.0),
          'reg_alpha':(0.0, 10),
          'reg_lambda':(0.0, 10),
      }
  )
  
  """开始优化"""
  bayes_lgb.maximize(n_iter=10)

结果:

  |   iter    |  target   | baggin... | baggin... | featur... | max_depth | min_ch... | min_da... | min_sp... | num_le... | reg_alpha | reg_la... |
  |  1        |  0.9785   |  0.5174   |  10.78    |  0.8746   |  10.15    |  4.288    |  48.97    |  0.2337   |  42.83    |  6.551    |  9.015    |
  |  2        |  0.9778   |  0.6777   |  41.77    |  0.5291   |  12.15    |  4.16     |  26.39    |  0.2461   |  55.78    |  6.528    |  0.6003   |
  |  3        |  0.9745   |  0.5825   |  68.77    |  0.5932   |  8.36     |  9.296    |  77.74    |  0.7946   |  79.12    |  3.045    |  5.593    |
  |  4        |  0.9802   |  0.9669   |  78.34    |  0.77     |  19.68    |  9.886    |  66.34    |  0.255    |  161.1    |  4.727    |  8.18     |
  |  5        |  0.9836   |  0.9897   |  51.9     |  0.9737   |  16.82    |  2.001    |  42.1     |  0.03563  |  134.2    |  3.437    |  1.368    |
  |  6        |  0.9749   |  0.5575   |  46.2     |  0.6518   |  15.9     |  7.817    |  34.12    |  0.341    |  153.2    |  7.144    |  7.899    |
  |  7        |  0.9793   |  0.9644   |  55.08    |  0.9795   |  18.5     |  2.085    |  41.22    |  0.7031   |  129.9    |  3.369    |  2.717    |
  |  8        |  0.9819   |  0.5926   |  58.23    |  0.6149   |  16.81    |  2.911    |  39.91    |  0.1699   |  137.3    |  2.685    |  2.891    |
  |  9        |  0.983    |  0.7796   |  50.38    |  0.7261   |  17.87    |  3.499    |  37.59    |  0.1404   |  136.1    |  2.442    |  6.621    |
  |  10       |  0.9843   |  0.638    |  49.32    |  0.9282   |  11.33    |  6.504    |  43.21    |  0.288    |  137.7    |  0.2083   |  6.966    |
  |  11       |  0.9798   |  0.8196   |  47.05    |  0.5845   |  9.075    |  2.965    |  46.16    |  0.3984   |  131.6    |  3.634    |  2.601    |
  |  12       |  0.9726   |  0.7688   |  37.57    |  0.9811   |  10.26    |  1.239    |  17.54    |  0.9651   |  46.5     |  8.834    |  6.276    |
  |  13       |  0.9836   |  0.5214   |  48.3     |  0.8203   |  19.13    |  3.129    |  35.47    |  0.08455  |  138.2    |  2.345    |  9.691    |
  |  14       |  0.9738   |  0.5617   |  45.75    |  0.8648   |  18.88    |  4.383    |  46.88    |  0.9315   |  141.8    |  4.968    |  5.563    |
  |  15       |  0.9807   |  0.8046   |  47.05    |  0.6449   |  12.38    |  0.3744   |  41.13    |  0.6808   |  138.7    |  0.8521   |  9.461    |
  =================================================================================================================================================
  """显示优化结果"""
  bayes_lgb.max
{'target': 0.9842625,
 'params': {'bagging_fraction': 0.6379596054685973,
  'bagging_freq': 49.319589248277715,
  'feature_fraction': 0.9282486828608231,
  'max_depth': 11.32826513626976,
  'min_child_weight': 6.5044214037514845,
  'min_data_in_leaf': 43.211716584925405,
  'min_split_gain': 0.28802399981965143,
  'num_leaves': 137.7332804262704,
  'reg_alpha': 0.2082701560002398,
  'reg_lambda': 6.966270735649479}}

参考资料

  1. https://zhuanlan.zhihu.com/p/74874291【机器学习】逻辑回归(非常详细)
  2. https://zhuanlan.zhihu.com/p/28408516 逻辑回归
  3. https://zhuanlan.zhihu.com/p/335254687决策树模型
  4. https://zhuanlan.zhihu.com/p/65888174集成模型
  5. https://jishuin.proginn.com/p/763bfbd37630数据集划分

标签:f1,score,23,lgb,调参,train,valid,20210325,模型
来源: https://blog.csdn.net/yuliuchenyin/article/details/115196993