首页 > 编程语言> > 机器学习入门（七）：分类算法——决策树算法

机器学习入门（七）：分类算法——决策树算法

2021-03-19 19:00:30 作者：互联网

学习目录：

决策树内容目录：

一.决策树作用：

在这里插入图片描述
这是我们判断这是个好瓜还是坏瓜的决策流程，决策树的作用：
1.帮助我们选择用哪个特征先做if，用哪个特征后做if，能最快的判断出这是好瓜还是坏瓜
2.帮助我们确定特征中作为划分标准的数值

二.原理推导

三.代码预测：

在这里插入图片描述

案例对比：比较决策树算法和KNN算法在鸢尾花数据集上的分类准确率

使用决策树算法对鸢尾花数据集分类：

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
def dicision_iris():
   """使用决策树对鸢尾花数据集进行分类
   ：return:"""
   # 获取数据
   iris = load_iris()
   # 划分数据集
   x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=22)
   #决策树估计器
   estimator=DecisionTreeClassifier(criterion="entropy")
   estimator.fit(x_train, y_train)  # 把训练数据放进去
   #模型评估
   #方法一：直接比对真实值和预测值
   y_predict=estimator.predict(x_test)
   print('y_predict:\n',y_predict)
   print('直接比对真实值和预测值:\n', y_test==y_predict)
   # 方法二：计算准确率
   score = estimator.score(x_test,y_test)
   print('准确率:\n', score)
   
if __name__=='__main__':
    dicision_iris()

在这里插入图片描述
使用KNN算法对鸢尾花数据集分类：

结论：KNN算法的准确率更高，因为KNN算法本来就适合小数据集（鸢尾花数据集只有150个样本），他去一个一个计算距离。而决策树更适合大数据集。--------不同算法适合不同场景

四.决策树可视化

在这里插入图片描述

生成的dot文件：

digraph Tree {
node [shape=box] ;
0 [label="petal width (cm) <= 0.75\nentropy = 1.584\nsamples = 112\nvalue = [39, 37, 36]"] ;
1 [label="entropy = 0.0\nsamples = 39\nvalue = [39, 0, 0]"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="petal width (cm) <= 1.75\nentropy = 1.0\nsamples = 73\nvalue = [0, 37, 36]"] ;
0 -> 2 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;
3 [label="petal length (cm) <= 5.05\nentropy = 0.391\nsamples = 39\nvalue = [0, 36, 3]"] ;
2 -> 3 ;
4 [label="sepal length (cm) <= 4.95\nentropy = 0.183\nsamples = 36\nvalue = [0, 35, 1]"] ;
3 -> 4 ;
5 [label="petal length (cm) <= 3.9\nentropy = 1.0\nsamples = 2\nvalue = [0, 1, 1]"] ;
4 -> 5 ;
6 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1, 0]"] ;
5 -> 6 ;
7 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 0, 1]"] ;
5 -> 7 ;
8 [label="entropy = 0.0\nsamples = 34\nvalue = [0, 34, 0]"] ;
4 -> 8 ;
9 [label="petal width (cm) <= 1.55\nentropy = 0.918\nsamples = 3\nvalue = [0, 1, 2]"] ;
3 -> 9 ;
10 [label="entropy = 0.0\nsamples = 2\nvalue = [0, 0, 2]"] ;
9 -> 10 ;
11 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1, 0]"] ;
9 -> 11 ;
12 [label="petal length (cm) <= 4.85\nentropy = 0.191\nsamples = 34\nvalue = [0, 1, 33]"] ;
2 -> 12 ;
13 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1, 0]"] ;
12 -> 13 ;
14 [label="entropy = 0.0\nsamples = 33\nvalue = [0, 0, 33]"] ;
12 -> 14 ;
}

在这里插入图片描述
把dot文件内容复制放进这个网站：

五.决策树算法总结：

优点：
          可视化—可解释能力强
缺点：
          当数据量非常大时，决策树会过于复杂，容易导致对训练样本好，测试样本差（过拟合）
如何改进？
          减枝cart算法（决策树API中已实现）
          随机森林算法（下一节讲）

六.案例：泰坦尼克号乘客生存预测

在这里插入图片描述

提取出来仓位、年龄、性别，其他的不影响他是否生存：

对年龄里的缺失值进行处理，填入平均值：

转换成字典格式：
划分数据集、字典特征抽取、决策树预估器、评估：

结果：

import pandas as pd
import graphviz
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer

# 获取数据
path = "http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt"
titanic = pd.read_csv(path)

# 获取特征值与目标值
x = titanic[['pclass', 'age', 'sex']]
y = titanic['survived']

# 处理数据
x['age'].fillna(x['age'].mean(), inplace=True)
# 转换成字典
x = x.to_dict(orient='recorda')
# 划分数据
x_train, x_test, y_train, y_test = train_test_split(x, y)

# 字典特征提取
transfer = DictVectorizer()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)

# 构建模型
estimator = DecisionTreeClassifier(criterion='entropy')
estimator.fit(x_train, y_train)
y_predict = estimator.predict(x_test)
print('预测值与真实值比对：\n', y_predict == y_test)
# 求准确率
score = model.score(x_test, y_test)
print('准确率：\n', score)
# 可视化决策树
image = export_graphviz(
      estimator,
      out_file="C:/Users/Admin/Desktop/iris_tree.dot",
      feature_names=transfer.get_feature_names(),
)

标签：入门,label,算法,train,entropy,test,import,决策树
来源： https://blog.csdn.net/qq_45234219/article/details/114998217