分类:决策树
作者:互联网
基础概念
信息熵
当前样本集合 D 中第 k 类样本所占的比例为 pk ,则 D 的信息熵定义为:
信息熵是度量样本集合纯度最常用的一种指标。信息熵越低,则样本的纯度越高。
条件熵
在已知样本属性a的取值情况下,假设离散属性 a 有 V 个可能的取值样本集合中,属性 a 上取值为av 的样本集合,记为 Dv,则D 的条件熵定义为:
条件熵也是度量样本集合纯度的一种指标
信息增益
信息增益=信息熵-条件熵,则属性a对样本集D进行划分所获得的信息增益为:
信息增益表示得知属性 a 的信息而使得样本集合不确定度减少的程度
信息增益率
信息增益率=信息增益/IV(a),说明信息增益率是信息增益除了一个属性a的固有值得来的。
基尼值
Gini(D)反映了从数据集D中随机抽取两个样本,其类别标记不一致的概率。因此,Gini(D)越小,则数据集D的纯度越高。
基尼指数
基尼值和基尼指数越小,样本集合纯度越高。
测试代码
# 决策树算法
# 主要分为三类:ID3、C4.5、CART
# 算法思想:
# ID3使用信息增益作为最优特征的选取标准,C4.5使用信息增益率作为最优特征的选取标准
# CART使用基尼指数作为最优特征的选取标准
import numpy as np
import pandas as pd
import math
# 计算信息熵
def calc_entropy(data):
entropy = 0.0
label_counts = {}
for i in range(len(data)):
current_label = data[i]
if current_label not in label_counts.keys():
label_counts[current_label] = 0
label_counts[current_label] += 1
for key in label_counts:
p = label_counts[key] / len(data)
entropy -= p * math.log(p, 2)
return entropy
# 计算数据集每一个特征的信息增益
def calc_gain(dataset, labels):
# 计算类标签的信息熵
label_entropy = calc_entropy(labels)
n, m = dataset.shape
entropy_matrix, gain_matrix = np.zeros(m), np.zeros(m)
# 存储每一类的总数
label_counts = {}
for i in range(len(labels)):
current_label = labels[i]
if current_label not in label_counts.keys():
label_counts[current_label] = 0
label_counts[current_label] += 1
# 存储每一类的下标
l_index = {}
for key in label_counts:
l_index[key] = [j for j, k in enumerate(labels) if k == key]
# 分别计算每一个特征的信息增益
for i in range(m):
feature_counts = {}
for j in range(n):
current_label = dataset[:, i][j]
if current_label not in feature_counts.keys():
feature_counts[current_label] = 0
feature_counts[current_label] += 1
f_index = {}
for key in feature_counts:
f_index[key] = [j for j, k in enumerate(dataset[:, i]) if k == key]
# 对于不同的特征,存储不同特征值的信息熵
ent = []
for value in f_index.values():
# 不同特征值的下标在类标签中的对应的值
data = []
for k in range(len(value)):
data.append(labels[value[k]])
ent.append(calc_entropy(data))
for v in range(len(ent)):
entropy_matrix[i] += list(feature_counts.values())[v] / n * ent[v]
gain_matrix[i] = label_entropy - entropy_matrix[i]
return gain_matrix
测试数据
dataset = np.array([["青绿", "蜷缩", "浊响", "清晰", "凹陷", "硬滑"],
["乌黑", "蜷缩", "沉闷", "清晰", "凹陷", "硬滑"],
["乌黑", "蜷缩", "浊响", "清晰", "凹陷", "硬滑"],
["青绿", "蜷缩", "沉闷", "清晰", "凹陷", "硬滑"],
["浅白", "蜷缩", "浊响", "清晰", "凹陷", "硬滑"],
["青绿", "稍蜷", "浊响", "清晰", "稍凹", "软粘"],
["乌黑", "稍蜷", "浊响", "稍糊", "稍凹", "软粘"],
["乌黑", "稍蜷", "浊响", "清晰", "稍凹", "硬滑"],
["乌黑", "稍蜷", "沉闷", "稍糊", "稍凹", "硬滑"],
["青绿", "硬挺", "清脆", "模糊", "平坦", "软粘"],
["浅白", "硬挺", "清脆", "模糊", "平坦", "硬滑"],
["浅白", "蜷缩", "浊响", "清晰", "平坦", "软粘"],
["青绿", "稍蜷", "浊响", "稍糊", "凹陷", "硬滑"],
["浅白", "稍蜷", "沉闷", "稍糊", "凹陷", "硬滑"],
["乌黑", "稍蜷", "浊响", "清晰", "稍凹", "软粘"],
["浅白", "蜷缩", "浊响", "模糊", "平坦", "硬滑"],
["青绿", "蜷缩", "沉闷", "稍糊", "稍凹", "硬滑"]])
labels = ["是", "是", "是", "是", "是", "是", "是", "是", "否", "否", "否", "否", "否", "否", "否", "否", "否"]
feature = ["色泽", "根蒂", "敲声", "纹理", "脐部", "触感"]
data = ["青绿", "蜷缩", "浊响", "清晰", "凹陷", "硬滑"]
测试结果
是
标签:分类,label,current,增益,entropy,硬滑,counts,决策树 来源: https://blog.csdn.net/weixin_49346755/article/details/121739226