其他分享
首页 > 其他分享> > Random Forest

Random Forest

作者:互联网

文章目录

随机森林

决策树集成学习有一定了解的基础上,再进一步理解随机森林采取的策略:样本数据、特征进行采样,训练的多棵决策树进行集成

基本概念

来自百度百科

根据下列算法而建造每棵树

相关概念

数据的随机选取

首先,从原始的数据集中采取有放回的抽样,构造子数据集,子数据集的数据量是和原始数据集相同的。不同子数据集的元素可以重复,同一个子数据集中的元素也可以重复。第二,利用子数据集来构建子决策树,将这个数据放到每个子决策树中,每个子决策树输出一个结果。最后,如果有了新的数据需要通过随机森林得到分类结果,就可以通过对子决策树的判断结果的投票,得到随机森林的输出结果了。如下图,假设随机森林中有3棵子决策树,2棵子树的分类结果是A类,1棵子树的分类结果是B类,那么随机森林的分类结果就是A类。

rf1

若训练集D有n个样本,有放回的进行n次随机采样,那么得到子集D1总共也是n个样本,但是D1中样本是有重复的。通过计算
lim ⁡ n → + ∞ ( 1 − 1 n ) = 1 e ≈ 0.368 \lim_{n \to +\infty} (1-\frac{1}{n})=\frac{1}{e} \approx 0.368 n→+∞lim​(1−n1​)=e1​≈0.368
可知采用有放回的采样,训练集D中大约63.2%的样本出现在子集D1中,而剩下的36.8%是不出现在D1中. 通常那36.8%的样本作为验证集来对泛化性能进行”包外估计”(out-of-bag estimate)。

特征的随机选取

与数据集的随机选取类似,随机森林中的子树的每一个分裂过程并未用到所有的待选特征,而是从所有的待选特征中随机选取一定的特征,之后再在随机选取的特征中选取最优的特征。这样能够使得随机森林中的决策树都能够彼此不同,提升系统的多样性,从而提升分类性能。
下图中,蓝色的方块代表所有可以被选择的特征,也就是待选特征。黄色的方块是分裂特征。左边是一棵决策树的特征选取过程,通过在待选特征中选取最优的分裂特征(别忘了前文提到的ID3算法,C4.5算法,CART算法等等),完成分裂。下面是一个随机森林中的子树的特征选取过程。

若总特征数为m,每一次训练选取的特征数k,通常推荐值 k = log ⁡ 2 m k=\log_2m k=log2​m 。简而言之,就是从m个特征中随机挑选k个特征用于最有划分.

rf2

样本随机选取与特征的随机选取参见:【机器学习-西瓜书】八、Bagging;随机森林(RF)

import warnings
import pandas as pd
import numpy as np
import scipy
import seaborn as sns
import xgboost as xgb
import tensorflow as tf
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from  sklearn import metrics 

# from sklearn.utils import class_weight
# from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
# from sklearn.metrics import roc_auc_score,precision_recall_curve

pd.set_option('display.max_columns',1000)
pd.set_option('display.max_rows',1000)
pd.set_option('display.width',100)

#忽略一些版本不兼容等警告
warnings.filterwarnings("ignore")
clf = RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=None, min_samples_split=2, 
                          min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', 
                          max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, 
                          bootstrap=True, oob_score=True, n_jobs=1, random_state=None, verbose=0,
                          warm_start=False, class_weight=None)

转载自:sklearn随机森林分类类RandomForestClassifier

RandomForestClassifier参数

寻找最佳分割时需要考虑的特征数目:

  • 如果是int,就要考虑每一次分割处的max_feature特征
  • 如果是float,那么max_features就是一个百分比,那么(max_feature*n_features)特征整数值是在每个分割处考虑的。
  • 如果是auto,那么max_features=sqrt(n_features),即n_features的平方根值。
  • 如果是log2,那么max_features=log2(n_features)
  • 如果是None,那么max_features=n_features
  • 注意:寻找分割点不会停止,直到找到最少一个有效的节点划分区,即使它需要有效检查超过max_features的特征。

分割内部节点所需要的最小样本数量:

  • 如果为int,那么考虑min_samples_split作为最小的数字。
  • 如果为float,那么min_samples_split是一个百分比,并且把ceil(min_samples_split*n_samples)是每一个分割最小的样本数量。

需要在叶子结点上的最小样本数量:

  • 如果为int,那么考虑min_samples_leaf作为最小的数字。
  • 如果为float,那么min_samples_leaf为一个百分比,并且ceil(min_samples_leaf*n_samples)是每一个节点的最小样本数量。
traindata_path = u'D:/01_Project/99_test/ML/titanic/train.csv'
testdata_path = u'D:/01_Project/99_test/ML/titanic/test.csv'
testresult_path = u'D:/01_Project/99_test/ML/titanic/gender_submission.csv'
df_train = pd.read_csv(traindata_path)
df_test = pd.read_csv(testdata_path)
df_test['Survived'] = pd.read_csv(testresult_path)['Survived']
data_original = pd.concat([df_train,df_test],sort=False)
# df_test = df_test[df_train.columns]
# display (df_train.head(5))
# data_original.drop('Name',axis=1,inplace=True)
# data_original.dropna(inplace=True)
display (data_original.head(5))
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
data_original.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 12 columns):
PassengerId    1309 non-null int64
Survived       1309 non-null int64
Pclass         1309 non-null int64
Name           1309 non-null object
Sex            1309 non-null object
Age            1046 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Ticket         1309 non-null object
Fare           1308 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 132.9+ KB
data_original['Sex'].replace('male',0,inplace=True)   #inplace=True 原位替换
data_original['Sex'].replace('female',1,inplace=True)
data_original['Embarked'] = data_original['Embarked'].fillna(method='bfill').fillna(method='ffill')
dummies = pd.get_dummies(data_original['Embarked'],prefix='Embarked')
dummies.head()
Embarked_CEmbarked_QEmbarked_S
0001
1100
2001
3001
4001
data_original = data_original.join(dummies)
data_original.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedEmbarked_CEmbarked_QEmbarked_S
0103Braund, Mr. Owen Harris022.010A/5 211717.2500NaNS001
0103Braund, Mr. Owen Harris022.010A/5 211717.2500NaNS010
089203Kelly, Mr. James034.5003309117.8292NaNQ001
089203Kelly, Mr. James034.5003309117.8292NaNQ010
1211Cumings, Mrs. John Bradley (Florence Briggs Th...138.010PC 1759971.2833C85C100
print (data_original['Sex'].value_counts())
data_original['Embarked'].value_counts()
0    1367
1     778
Name: Sex, dtype: int64





S    1482
C     453
Q     210
Name: Embarked, dtype: int64
print (data_original['Embarked'].unique())
feature = ['Pclass','Age','SibSp','Parch','Sex']+['Embarked_'+i for i in data_original['Embarked'].unique()]
print (feature)
for column in feature:
    data_original[column].fillna(data_original[column].mean(), inplace=True)
x_train, x_test, y_train, y_test = train_test_split(data_original[feature], data_original['Survived'], random_state=1, train_size=0.7)
# x_train, x_test, y_train, y_test = train_test_split(data_original, data_original['Survived'], random_state=1, train_size=0.7)
display(x_train.shape)
display(x_test.shape)
display(y_train.shape)
display(y_train.shape)
['S' 'Q' 'C']
['Pclass', 'Age', 'SibSp', 'Parch', 'Sex', 'Embarked_S', 'Embarked_Q', 'Embarked_C']



(1501, 8)



(644, 8)



(1501,)



(1501,)
clf.fit(x_train, y_train)
RandomForestClassifier(n_estimators=10, n_jobs=1, oob_score=True)

RandomForestClassifier属性

clf.estimators_
[DecisionTreeClassifier(max_features='auto', random_state=1495518603),
 DecisionTreeClassifier(max_features='auto', random_state=170010899),
 DecisionTreeClassifier(max_features='auto', random_state=2053236039),
 DecisionTreeClassifier(max_features='auto', random_state=1004379910),
 DecisionTreeClassifier(max_features='auto', random_state=2052542410),
 DecisionTreeClassifier(max_features='auto', random_state=834032305),
 DecisionTreeClassifier(max_features='auto', random_state=413200844),
 DecisionTreeClassifier(max_features='auto', random_state=801999364),
 DecisionTreeClassifier(max_features='auto', random_state=1345507579),
 DecisionTreeClassifier(max_features='auto', random_state=1667197337)]
clf.classes_
array([0, 1], dtype=int64)
clf.n_classes_
2
clf.n_features_
8
clf.n_outputs_
1
clf.feature_importances_
dict_importance = dict(zip(feature,clf.feature_importances_))
# display (dict_importance)
df_feature_importance = pd.DataFrame()
df_feature_importance['features'] = feature
df_feature_importance['importances'] = clf.feature_importances_
df_feature_importance = df_feature_importance.sort_values('importances',ascending=False)
df_feature_importance
featuresimportances
4Sex0.561840
1Age0.249461
0Pclass0.061210
2SibSp0.054826
3Parch0.047580
5Embarked_S0.008941
6Embarked_Q0.008869
7Embarked_C0.007272
clf.oob_score_
0.8500999333777481
clf.oob_decision_function_
array([[0.89891015, 0.10108985],
       [0.75      , 0.25      ],
       [0.        , 1.        ],
       ...,
       [0.17672414, 0.82327586],
       [1.        , 0.        ],
       [1.        , 0.        ]])

混淆矩阵

pred_y_test = clf.predict(x_test)
# m = metrics.confusion_matrix(y_test, pred_y_test)
# display (m)
tn, fp, fn, tp = metrics.confusion_matrix(y_test, pred_y_test).ravel()
print ('matrix    label1   label0')
print ('predict1  {:<6d}   {:<6d}'.format(int(tp), int(fp)))
print ('predict0  {:<6d}   {:<6d}'.format(int(fn), int(tn)))
matrix    label1   label0
predict1  179      34    
predict0  51       380   

交叉验证

验证模型得分

score_x = x_train
score_y = y_train
# 正确率
scores = cross_val_score(clf, score_x, score_y, cv=5, scoring='accuracy')
print('交叉验证正确率为:'+str(scores.mean()))  
交叉验证正确率为:0.8640974529346623
# 精确率
scores = cross_val_score(clf, score_x, score_y, cv=5, scoring='precision')
print('交叉验证精确率为:'+str(scores.mean()))  
交叉验证精确率为:0.837064112977926
# 召回率
scores = cross_val_score(clf, score_x, score_y, cv=5, scoring='recall')
print('交叉验证召回率为:'+str(scores.mean()))  
交叉验证召回率为:0.7876161919040481
# f1_score
scores = cross_val_score(clf, score_x, score_y, cv=5, scoring='f1')
print('交叉验证f1_score为:'+str(scores.mean()))  
交叉验证f1_score为:0.8135730532581953

网格搜索最佳参数

param_grid = [
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]
clf = RandomForestClassifier()
# clf = xgb.XGBClassifier()
grid_search = GridSearchCV(clf, param_grid, cv=5,scoring='neg_mean_squared_error')
grid_search.fit(x_train, y_train)
print (grid_search.best_params_)
print (grid_search.best_estimator_)
{'max_features': 8, 'n_estimators': 30}
RandomForestClassifier(max_features=8, n_estimators=30)

查看特征的正负样本分布

def KdePlot(df,label,factor,flag=None,positive=1):
    import seaborn as sns
    import matplotlib.pyplot as plt
    
    # 设置核密度分布图
    plt.figure(figsize=(20,10))
    sns.set(style='white')
    if positive==0:
        df[factor] = np.abs(df[factor])
    else:
        pass
    if flag == 'log':
        x0 = np.log(df[df[label]==0][factor]+1)
        x1 = np.log(df[df[label]==1][factor]+1)
    else:
        x0 = df[df[label]==0][factor]
        x1 = df[df[label]==1][factor]
        
    sns.distplot(x0,
               color = 'blue',
               kde = True, # 绘制密度曲线
               hist = True, # 绘制直方图
               #rug = True, # rug图
               kde_kws = {'shade':True,'color':'green','facecolor':'green','label':'label_0'},
               rug_kws = {'color':'green','height':0.1,'alpha':0.1})
    plt.xlabel('%s'%factor,fontsize=40)
    plt.ylabel('label_0',fontsize = 30)
    plt.xticks(fontsize = 30)
    plt.yticks(fontsize = 30)
    plt.legend(loc='upper left',fontsize=30)
    
    plt.twinx()
    
    sns.distplot(x1,
               color = 'orange',
               kde = True, # 绘制密度曲线
               hist = True, # 绘制直方图
               #rug = True, # rug图
               kde_kws = {'shade':True,'color':'red','facecolor':'red','label':'label_1'},
               rug_kws = {'color':'red','height':0.1,'alpha':0.2})
#     plt.xlabel('%s'%factor,fontsize=40)
    plt.ylabel('label_1',fontsize = 30)
    plt.xticks(fontsize = 30)
    plt.yticks(fontsize = 30)
    plt.legend(loc='upper right',fontsize=30)
    plt.show()
    
for factor in df_feature_importance['features'].values:
    KdePlot(data_original,'Survived',factor)

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

标签:features,df,max,Random,特征,score,Forest,data
来源: https://blog.csdn.net/weixin_42432468/article/details/114942218