使用朴素贝叶斯进行个人信用风险评估
作者:互联网
朴素贝叶斯
朴素贝叶斯方法是基于贝叶斯定理的一组有监督学习算法,即“简单”地假设每对特征之间相互独立。 给定一个类别
y
y
y和一个从
x
1
x_1
x1到
x
n
x_n
xn的相关的特征向量,贝叶斯定理阐述了一下关系:
P
(
y
∣
x
1
,
…
,
x
n
)
=
P
(
y
)
P
(
x
1
,
…
,
x
n
∣
y
)
P
(
x
1
,
…
,
x
n
)
P(y \mid x_1, \dots, x_n) = \frac{P(y) P(x_1, \dots, x_n \mid y)}{P(x_1, \dots, x_n)}
P(y∣x1,…,xn)=P(x1,…,xn)P(y)P(x1,…,xn∣y)
使用简单(naive)的假设-每对特征之间都相互独立:
P
(
x
i
∣
y
,
x
1
,
…
,
x
i
−
1
,
x
i
+
1
,
…
,
x
n
)
=
P
(
x
i
∣
y
)
P(x_i | y, x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_n) = P(x_i | y)
P(xi∣y,x1,…,xi−1,xi+1,…,xn)=P(xi∣y)
对于所有的
i
i
i都成立,这个关系式可以简化为:
P
(
y
∣
x
1
,
…
,
x
n
)
=
P
(
y
)
∏
i
=
1
n
P
(
x
i
∣
y
)
P
(
x
1
,
…
,
x
n
)
P(y \mid x_1, \dots, x_n) = \frac{P(y) \prod_{i=1}^{n} P(x_i \mid y)}{P(x_1, \dots, x_n)}
P(y∣x1,…,xn)=P(x1,…,xn)P(y)∏i=1nP(xi∣y)
由于在给定的输入中
P
(
x
1
,
.
.
.
,
x
n
)
P(x_1,...,x_n)
P(x1,...,xn)是一个常量,用下面的分类规则:
\begin{align}\begin{aligned}P(y \mid x_1, \dots, x_n) \propto P(y) \prod_{i=1}^{n} P(x_i \mid y)\ \Rightarrow\ \hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i \mid y),\end{aligned}\end{align}
我们可以用最大后验(MAP)估计来估计
P
(
y
)
P(y)
P(y)和
P
(
x
i
∣
y
)
P(x_i \mid y)
P(xi∣y);前者是训练集中类别
y
y
y的相对频率。
各种各样的的朴素贝叶斯分类器的差异大部分来自于处理
P
(
x
i
∣
y
)
P(x_i \mid y)
P(xi∣y)分布时的所做的假设不同。
尽管其假设过于简单,在很多实际情况下,朴素贝叶斯工作得很好,特别是文档分类和垃圾邮件过滤。这些工作都要求一个小的训练集来估计必需参数。
相比于其他更复杂的方法,朴素贝叶斯学习器和分类器非常快。分类条件分布的解耦意味着可以独立单独地把每个特征视为一维分布来估计。这样反过来有助于缓解维度灾难带来的问题。
另一方面,尽管朴素贝叶斯被认为是一种相当不错的分类器,但却不是好的估计器(estimator),所以不能太过于重视从predict_proba
输出的概率。
高斯朴素贝叶斯
GaussianNB
实现了运用于分类的高斯朴素贝叶斯算法。特征的可能性(即概率)假设为高斯分布:
P
(
x
i
∣
y
)
=
1
2
π
σ
y
2
exp
(
−
(
x
i
−
μ
y
)
2
2
σ
y
2
)
P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma^2_y}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma^2_y}\right)
P(xi∣y)=2πσy2
1exp(−2σy2(xi−μy)2)
参数 \sigma_y 和 \mu_y 使用最大似然法估计。
多项分布朴素贝叶斯
MultinomialNB
实现了服从多项分布数据的朴素贝叶斯算法,也是用于文本分类(这个领域中数据往往以词向量表示,尽管在实践中tf-idf
向量在预测时表现良好)的两大经典朴素贝叶斯算法之一。分布参数由每类
y
y
y的
θ
y
=
(
θ
y
1
,
…
,
θ
y
n
)
\theta_y=(\theta_{y1},\ldots,\theta_{yn})
θy=(θy1,…,θyn)向量决定, 式中
n
n
n是特征的数量(对于文本分类,是词汇量的大小)
θ
y
i
\theta_{yi}
θyi是样本中属于类&y&中特征&i&概率&P(x_i \mid y)&。
参数&\theta_y&使用平滑过的最大似然估计法来估计,即相对频率计数:
θ
^
y
i
=
N
y
i
+
α
N
y
+
α
n
\hat{\theta}_{yi} = \frac{ N_{yi} + \alpha}{N_y + \alpha n}
θ^yi=Ny+αnNyi+α
式中
N
y
i
=
∑
x
∈
T
x
i
N_{yi}=\sum_{x \in T}x_i
Nyi=∑x∈Txi是训练集T中特征
i
i
i在类
y
y
y中出现的次数,
N
y
=
∑
i
=
1
∣
T
∣
N
y
i
N_y=\sum_{i=1}^{|T|}N_{yi}
Ny=∑i=1∣T∣Nyi是类
y
y
y中出现所有特征的计数总和。
先验平滑因子
α
≥
0
\alpha \ge 0
α≥0为在学习样本中没有出现的特征而设计,以防在将来的计算中出现0概率输出。
把$\alpha = 1
被
称
为
拉
普
拉
斯
平
滑
(
L
a
p
a
l
c
e
s
m
o
o
t
h
i
n
g
)
,
而
被称为拉普拉斯平滑(Lapalce smoothing),而
被称为拉普拉斯平滑(Lapalcesmoothing),而\alpha < 1$被称为Lidstone平滑方法(Lidstone smoothing)。
伯努利朴素贝叶斯
BernoulliNB
实现了用于多重伯努利分布数据的朴素贝叶斯训练和分类算法,即有多个特征,但每个特征都假设是一个二元(Bernoulli, boolean)变量。因此,这类算法要求样本以二元值特征向量表示;如果样本含有其他类型的数据,一个BernoulliNB
实例会将其二值化(取决于binarize
参数)。
伯努利朴素贝叶斯的决策规则基于:
P ( x i ∣ y ) = P ( i ∣ y ) x i + ( 1 − P ( i ∣ y ) ) ( 1 − x i ) P(x_i \mid y) = P(i \mid y) x_i + (1 - P(i \mid y)) (1 - x_i) P(xi∣y)=P(i∣y)xi+(1−P(i∣y))(1−xi)
与多项分布朴素贝叶斯的规则不同 伯努利朴素贝叶斯明确地惩罚类 y y y中没有出现作为预测因子的特征 i i i,而多项分布分布朴素贝叶斯只是简单地忽略没出现的特征。
在文本分类的示例中,统计词语是否出现的向量(word occurrence vectors)(而非统计词语出现次数的向量(word count vectors))可以用于训练和使用这个分类器。BernoulliNB
可能在一些数据集上表现得更好,特别是那些更短的文档。
使用朴素贝叶斯进行个人信用风险评估
数据源与查看数据
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
credit = pd.read_csv("./input/credit.csv")
credit.head(5)
checking_balance | months_loan_duration | credit_history | purpose | amount | savings_balance | employment_length | installment_rate | personal_status | other_debtors | ... | property | age | installment_plan | housing | existing_credits | job | dependents | telephone | foreign_worker | default | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | < 0 DM | 6 | critical | radio/tv | 1169 | unknown | > 7 yrs | 4 | single male | none | ... | real estate | 67 | none | own | 2 | skilled employee | 1 | yes | yes | 1 |
1 | 1 - 200 DM | 48 | repaid | radio/tv | 5951 | < 100 DM | 1 - 4 yrs | 2 | female | none | ... | real estate | 22 | none | own | 1 | skilled employee | 1 | none | yes | 2 |
2 | unknown | 12 | critical | education | 2096 | < 100 DM | 4 - 7 yrs | 2 | single male | none | ... | real estate | 49 | none | own | 1 | unskilled resident | 2 | none | yes | 1 |
3 | < 0 DM | 42 | repaid | furniture | 7882 | < 100 DM | 4 - 7 yrs | 2 | single male | guarantor | ... | building society savings | 45 | none | for free | 1 | skilled employee | 2 | none | yes | 1 |
4 | < 0 DM | 24 | delayed | car (new) | 4870 | < 100 DM | 1 - 4 yrs | 3 | single male | none | ... | unknown/none | 53 | none | for free | 2 | skilled employee | 2 | none | yes | 2 |
5 rows × 21 columns
数据预处理
checking_balance
,credit_history
,purpose
,savings_balance
,employment_length
,personal_status
,other_debtors
,property
,installment_plan
,housing,job
,telephone
,foreign_worker
为字符串类型形式的变量,需要预处理使用整数进行编码。
col_dicts = {}
cols = ['checking_balance','credit_history', 'purpose', 'savings_balance', 'employment_length', 'personal_status',
'other_debtors','property','installment_plan','housing','job','telephone','foreign_worker']
col_dicts = {'checking_balance': {'unknown': 0,
'< 0 DM': 1,
'1 - 200 DM': 2,
'> 200 DM': 3
},
'credit_history': {'critical': 0,
'repaid': 1,
'delayed': 2,
'fully repaid': 3,
'fully repaid this bank': 4
},
'employment_length': {'unemployed': 0,
'0 - 1 yrs': 1,
'1 - 4 yrs': 2,
'4 - 7 yrs': 3,
'> 7 yrs': 4
},
'foreign_worker': {'yes': 0 ,'no': 1},
'housing': {'own': 0, 'for free': 1, 'rent': 2},
'installment_plan': {'none': 0, 'bank': 1, 'stores': 2},
'job': {'unemployed non-resident': 0,
'unskilled resident': 1,
'skilled employee': 2,
'mangement self-employed': 3
},
'other_debtors': {'none': 0,
'guarantor': 1,
'co-applicant': 2 },
'personal_status': {'single male': 0,
'female': 1,
'divorced male': 2,
'married male': 3
},
'property': {'real estate': 0,
'building society savings': 1,
'unknown/none': 2,
'other': 3
},
'purpose': {'radio/tv': 0,
'education': 1,
'furniture': 2,
'car (new)': 3,
'car (used)': 4,
'business': 5,
'domestic appliances': 6,
'repairs': 7,
'others': 8,
'retraining': 9},
'savings_balance': {'unknown': 0,
'< 100 DM': 1,
'101 - 500 DM': 2,
'501 - 1000 DM': 3,
'> 1000 DM': 4
},
'telephone': {'none': 1, 'yes': 0}}
for col in cols:
credit[col] = credit[col].map(col_dicts[col])
credit.head(5)
checking_balance | months_loan_duration | credit_history | purpose | amount | savings_balance | employment_length | installment_rate | personal_status | other_debtors | ... | property | age | installment_plan | housing | existing_credits | job | dependents | telephone | foreign_worker | default | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 6 | 0 | 0 | 1169 | 0 | 4 | 4 | 0 | 0 | ... | 0 | 67 | 0 | 0 | 2 | 2 | 1 | 0 | 0 | 1 |
1 | 2 | 48 | 1 | 0 | 5951 | 1 | 2 | 2 | 1 | 0 | ... | 0 | 22 | 0 | 0 | 1 | 2 | 1 | 1 | 0 | 2 |
2 | 0 | 12 | 0 | 1 | 2096 | 1 | 3 | 2 | 0 | 0 | ... | 0 | 49 | 0 | 0 | 1 | 1 | 2 | 1 | 0 | 1 |
3 | 1 | 42 | 1 | 2 | 7882 | 1 | 3 | 2 | 0 | 1 | ... | 1 | 45 | 0 | 1 | 1 | 2 | 2 | 1 | 0 | 1 |
4 | 1 | 24 | 2 | 3 | 4870 | 1 | 2 | 3 | 0 | 0 | ... | 2 | 53 | 0 | 1 | 2 | 2 | 2 | 1 | 0 | 2 |
5 rows × 21 columns
特征分析
获取特征的相关性矩阵,可以查看各变量之间的依赖关系。
import numpy as np
corrmat=credit.corr()#获取相关性矩阵
corrmat
checking_balance | months_loan_duration | credit_history | purpose | amount | savings_balance | employment_length | installment_rate | personal_status | other_debtors | ... | property | age | installment_plan | housing | existing_credits | job | dependents | telephone | foreign_worker | default | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
checking_balance | 1.000000 | 0.035050 | 0.138210 | 0.017272 | 0.024561 | -0.005614 | -0.108536 | -0.057942 | 0.069946 | 0.041970 | ... | -0.005623 | -0.049058 | 0.033566 | 0.032925 | -0.093081 | -0.054255 | -0.040889 | 0.039209 | -0.000205 | 0.197788 |
months_loan_duration | 0.035050 | 1.000000 | 0.142631 | 0.105305 | 0.624984 | -0.064526 | 0.057381 | 0.074749 | -0.116029 | 0.006711 | ... | 0.245655 | -0.036136 | 0.076992 | 0.011950 | -0.011284 | 0.210910 | -0.023834 | -0.164718 | -0.138196 | 0.214927 |
credit_history | 0.138210 | 0.142631 | 1.000000 | 0.143938 | 0.113776 | 0.019657 | -0.097325 | -0.024740 | -0.005519 | -0.008955 | ... | 0.071606 | -0.070046 | 0.239431 | 0.077417 | -0.207960 | 0.001718 | 0.051849 | 0.018283 | -0.041784 | 0.232157 |
purpose | 0.017272 | 0.105305 | 0.143938 | 1.000000 | 0.203234 | 0.005263 | -0.052126 | -0.092747 | -0.035918 | -0.020423 | ... | 0.027161 | 0.066020 | 0.049489 | 0.028464 | 0.071995 | 0.025409 | 0.077245 | -0.116031 | 0.035655 | 0.051311 |
amount | 0.024561 | 0.624984 | 0.113776 | 0.203234 | 1.000000 | -0.107538 | -0.008367 | -0.271316 | -0.159434 | 0.037921 | ... | 0.224550 | 0.032716 | 0.045815 | 0.056119 | 0.020795 | 0.285385 | 0.017142 | -0.276995 | -0.050050 | 0.154739 |
savings_balance | -0.005614 | -0.064526 | 0.019657 | 0.005263 | -0.107538 | 1.000000 | 0.014600 | -0.000805 | 0.062953 | -0.047575 | ... | -0.004121 | -0.017997 | 0.009373 | 0.003268 | -0.004176 | -0.040803 | -0.021302 | 0.037452 | 0.005318 | -0.033871 |
employment_length | -0.108536 | 0.057381 | -0.097325 | -0.052126 | -0.008367 | 0.014600 | 1.000000 | 0.126161 | -0.181745 | -0.028758 | ... | 0.065533 | 0.256227 | -0.008676 | -0.044583 | 0.125791 | 0.101225 | 0.097192 | -0.060518 | -0.027232 | -0.116002 |
installment_rate | -0.057942 | 0.074749 | -0.024740 | -0.092747 | -0.271316 | -0.000805 | 0.126161 | 1.000000 | -0.081121 | -0.014835 | ... | 0.039353 | 0.058266 | 0.034750 | -0.073955 | 0.021669 | 0.097755 | -0.071207 | -0.014413 | -0.090024 | 0.072404 |
personal_status | 0.069946 | -0.116029 | -0.005519 | -0.035918 | -0.159434 | 0.062953 | -0.181745 | -0.081121 | 1.000000 | -0.011880 | ... | -0.099575 | -0.186563 | -0.065461 | 0.083146 | -0.089640 | -0.064335 | -0.238327 | 0.057207 | 0.009204 | 0.042643 |
other_debtors | 0.041970 | 0.006711 | -0.008955 | -0.020423 | 0.037921 | -0.047575 | -0.028758 | -0.014835 | -0.011880 | 1.000000 | ... | -0.101378 | -0.028294 | -0.000955 | 0.036219 | -0.017662 | -0.021106 | -0.010990 | 0.050996 | 0.107639 | 0.028441 |
residence_history | -0.059555 | 0.034067 | -0.027989 | 0.073651 | 0.028926 | -0.011772 | 0.245081 | 0.049302 | -0.106742 | -0.012690 | ... | 0.055260 | 0.266419 | -0.034517 | 0.255106 | 0.089625 | 0.012655 | 0.042643 | -0.095359 | -0.054097 | 0.002967 |
property | -0.005623 | 0.245655 | 0.071606 | 0.027161 | 0.224550 | -0.004121 | 0.065533 | 0.039353 | -0.099575 | -0.101378 | ... | 1.000000 | -0.054186 | 0.041147 | 0.022420 | 0.001209 | 0.244946 | -0.041111 | -0.155051 | -0.138772 | 0.090146 |
age | -0.049058 | -0.036136 | -0.070046 | 0.066020 | 0.032716 | -0.017997 | 0.256227 | 0.058266 | -0.186563 | -0.028294 | ... | -0.054186 | 1.000000 | 0.021858 | -0.108437 | 0.149254 | 0.015673 | 0.118201 | -0.145259 | -0.006151 | -0.091127 |
installment_plan | 0.033566 | 0.076992 | 0.239431 | 0.049489 | 0.045815 | 0.009373 | -0.008676 | 0.034750 | -0.065461 | -0.000955 | ... | 0.041147 | 0.021858 | 1.000000 | -0.077624 | 0.046993 | 0.009872 | 0.057595 | -0.030704 | -0.036734 | 0.104885 |
housing | 0.032925 | 0.011950 | 0.077417 | 0.028464 | 0.056119 | 0.003268 | -0.044583 | -0.073955 | 0.083146 | 0.036219 | ... | 0.022420 | -0.108437 | -0.077624 | 1.000000 | -0.052609 | 0.015201 | -0.015004 | 0.003307 | 0.005155 | 0.123815 |
existing_credits | -0.093081 | -0.011284 | -0.207960 | 0.071995 | 0.020795 | -0.004176 | 0.125791 | 0.021669 | -0.089640 | -0.017662 | ... | 0.001209 | 0.149254 | 0.046993 | -0.052609 | 1.000000 | -0.026321 | 0.109667 | -0.065553 | -0.009717 | -0.045732 |
job | -0.054255 | 0.210910 | 0.001718 | 0.025409 | 0.285385 | -0.040803 | 0.101225 | 0.097755 | -0.064335 | -0.021106 | ... | 0.244946 | 0.015673 | 0.009872 | 0.015201 | -0.026321 | 1.000000 | -0.093559 | -0.383022 | -0.100944 | 0.032735 |
dependents | -0.040889 | -0.023834 | 0.051849 | 0.077245 | 0.017142 | -0.021302 | 0.097192 | -0.071207 | -0.238327 | -0.010990 | ... | -0.041111 | 0.118201 | 0.057595 | -0.015004 | 0.109667 | -0.093559 | 1.000000 | 0.014753 | 0.077071 | -0.003015 |
telephone | 0.039209 | -0.164718 | 0.018283 | -0.116031 | -0.276995 | 0.037452 | -0.060518 | -0.014413 | 0.057207 | 0.050996 | ... | -0.155051 | -0.145259 | -0.030704 | 0.003307 | -0.065553 | -0.383022 | 0.014753 | 1.000000 | 0.107401 | 0.036466 |
foreign_worker | -0.000205 | -0.138196 | -0.041784 | 0.035655 | -0.050050 | 0.005318 | -0.027232 | -0.090024 | 0.009204 | 0.107639 | ... | -0.138772 | -0.006151 | -0.036734 | 0.005155 | -0.009717 | -0.100944 | 0.077071 | 0.107401 | 1.000000 | -0.082079 |
default | 0.197788 | 0.214927 | 0.232157 | 0.051311 | 0.154739 | -0.033871 | -0.116002 | 0.072404 | 0.042643 | 0.028441 | ... | 0.090146 | -0.091127 | 0.104885 | 0.123815 | -0.045732 | 0.032735 | -0.003015 | 0.036466 | -0.082079 | 1.000000 |
21 rows × 21 columns
使用seaborn
绘图库绘制出相关型矩阵热度图,各变量间相关度并不高。我们可以“简单”地假设每对特征之间相互独立。
import seaborn as sns
sns.set(font_scale=1.5)#字符大小设定
plt.figure(figsize=(15, 15))
hm=sns.heatmap(corrmat, cbar=True, square=True, yticklabels=credit.columns, xticklabels=credit.columns,cmap="YlGnBu")
plt.show()
模型选择
先进行数据划分,需要将数据集分为训练集和测试集两部分。其中训练集用来构建朴素贝叶斯模型,测试集用来评估模型性能。
from sklearn import model_selection
from sklearn import metrics
y = credit['default']
#del credit['default']
X = credit.loc[:,'checking_balance':'foreign_worker']
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3, random_state=1)
使用多项分布朴素贝叶斯模型,对训练数据集进行拟合。predict_proba
输出对于X_test中的每行数据得到的对于两种预测结果的后验概率。因此被分类到后验概率较大的一类中。
from sklearn.naive_bayes import MultinomialNB
clf_multi = MultinomialNB()
clf_multi.fit(X_train,y_train)
y_pred = clf_multi.predict(X_test)
print(clf_multi.predict_proba(X_test))
[[7.08355206e-09 9.99999993e-01]
[4.05129188e-26 1.00000000e+00]
[9.79691264e-01 2.03087364e-02]
[9.82619335e-01 1.73806654e-02]
[4.94859998e-04 9.99505140e-01]
[9.99998409e-01 1.59110968e-06]
[8.85299264e-04 9.99114701e-01]
[9.99912434e-01 8.75657002e-05]
[9.99998245e-01 1.75510442e-06]
[9.87150312e-01 1.28496881e-02]
[9.99791955e-01 2.08045167e-04]
[7.17413046e-02 9.28258695e-01]
[1.42769922e-09 9.99999999e-01]
[9.99997304e-01 2.69595250e-06]
[5.61504493e-08 9.99999944e-01]
[9.99995078e-01 4.92187029e-06]
[8.54170929e-01 1.45829071e-01]
[9.99988770e-01 1.12298436e-05]
[9.99290453e-01 7.09546685e-04]
[5.59657372e-14 1.00000000e+00]
[9.99999766e-01 2.33633118e-07]
[9.99991893e-01 8.10675856e-06]
[9.27312798e-01 7.26872020e-02]
[9.99999972e-01 2.81816265e-08]
[9.99572578e-01 4.27421788e-04]
[9.99554836e-01 4.45164490e-04]
[7.54262725e-11 1.00000000e+00]
[9.99583183e-01 4.16817416e-04]
[9.99999892e-01 1.07557333e-07]
[2.03671867e-20 1.00000000e+00]
[9.30936726e-03 9.90690633e-01]
[4.22137734e-01 5.77862266e-01]
[3.07988013e-24 1.00000000e+00]
[9.99892134e-01 1.07866386e-04]
[9.98896579e-01 1.10342119e-03]
[1.27516939e-19 1.00000000e+00]
[3.21326686e-13 1.00000000e+00]
[9.99978102e-01 2.18978094e-05]
[9.99543666e-01 4.56334493e-04]
[9.75120153e-01 2.48798467e-02]
[9.99783249e-01 2.16751016e-04]
[7.53225441e-01 2.46774559e-01]
[1.00000000e+00 1.52390295e-10]
[9.23327518e-01 7.66724815e-02]
[9.99999996e-01 3.96981750e-09]
[9.99201910e-01 7.98090310e-04]
[7.42536445e-01 2.57463555e-01]
[9.99779916e-01 2.20084349e-04]
[9.97465543e-01 2.53445731e-03]
[9.99522680e-01 4.77319741e-04]
[9.99729186e-01 2.70814335e-04]
[9.99736721e-01 2.63278865e-04]
[5.64344942e-04 9.99435655e-01]
[4.00026914e-05 9.99959997e-01]
[9.53117144e-06 9.99990469e-01]
[8.69420434e-01 1.30579566e-01]
[9.99999997e-01 3.07306827e-09]
[1.57057604e-06 9.99998429e-01]
[9.99530449e-01 4.69551156e-04]
[4.44235731e-07 9.99999556e-01]
[9.99997479e-01 2.52079810e-06]
[9.99985346e-01 1.46542234e-05]
[3.09048564e-19 1.00000000e+00]
[7.49952563e-01 2.50047437e-01]
[9.98645456e-01 1.35454403e-03]
[5.61016777e-15 1.00000000e+00]
[9.99599379e-01 4.00620678e-04]
[1.00000000e+00 4.94688483e-10]
[9.93921444e-01 6.07855588e-03]
[2.55830477e-12 1.00000000e+00]
[9.98128166e-01 1.87183359e-03]
[9.40583266e-01 5.94167339e-02]
[9.99999664e-01 3.35665419e-07]
[9.99965174e-01 3.48257351e-05]
[7.96495634e-01 2.03504366e-01]
[2.30045586e-01 7.69954414e-01]
[3.43845989e-01 6.56154011e-01]
[9.99999399e-01 6.00685647e-07]
[9.99092122e-01 9.07878038e-04]
[9.99932474e-01 6.75263730e-05]
[3.18979875e-11 1.00000000e+00]
[9.79179080e-01 2.08209204e-02]
[1.00000000e+00 1.36448477e-11]
[9.99993882e-01 6.11801758e-06]
[9.65786631e-01 3.42133689e-02]
[9.76111368e-01 2.38886319e-02]
[9.99999941e-01 5.85847263e-08]
[9.99999819e-01 1.80804466e-07]
[9.99922821e-01 7.71793093e-05]
[9.99993133e-01 6.86748630e-06]
[9.99268149e-01 7.31851195e-04]
[5.22545593e-01 4.77454407e-01]
[1.97973112e-10 1.00000000e+00]
[9.88164915e-01 1.18350852e-02]
[9.99969019e-01 3.09809787e-05]
[1.20452595e-05 9.99987955e-01]
[9.95753421e-01 4.24657901e-03]
[1.50879629e-01 8.49120371e-01]
[3.69758990e-03 9.96302410e-01]
[1.32728342e-07 9.99999867e-01]
[9.67105940e-01 3.28940603e-02]
[1.00000000e+00 2.51889396e-10]
[4.20607438e-05 9.99957939e-01]
[9.99999995e-01 4.50441505e-09]
[9.98196982e-01 1.80301823e-03]
[2.60895496e-01 7.39104504e-01]
[9.99964355e-01 3.56449651e-05]
[3.40718013e-08 9.99999966e-01]
[6.68017653e-28 1.00000000e+00]
[9.99982158e-01 1.78417815e-05]
[9.99999723e-01 2.77421602e-07]
[9.99797624e-01 2.02375590e-04]
[8.59527536e-01 1.40472464e-01]
[1.81821662e-08 9.99999982e-01]
[1.89849679e-08 9.99999981e-01]
[1.90552896e-07 9.99999809e-01]
[9.99991452e-01 8.54833987e-06]
[9.99999999e-01 8.90283237e-10]
[9.86238948e-01 1.37610516e-02]
[9.19056892e-01 8.09431077e-02]
[5.20189940e-15 1.00000000e+00]
[1.07176072e-06 9.99998928e-01]
[1.96524734e-08 9.99999980e-01]
[9.22312119e-02 9.07768788e-01]
[9.99999561e-01 4.38736439e-07]
[3.14262425e-02 9.68573758e-01]
[9.99656508e-01 3.43491997e-04]
[4.49764172e-01 5.50235828e-01]
[9.99551027e-01 4.48973406e-04]
[4.67593429e-01 5.32406571e-01]
[9.99995983e-01 4.01738206e-06]
[9.79193431e-01 2.08065690e-02]
[4.33184050e-06 9.99995668e-01]
[9.99993077e-01 6.92318920e-06]
[9.99999812e-01 1.88485229e-07]
[3.02205341e-06 9.99996978e-01]
[9.07568653e-01 9.24313468e-02]
[1.22717545e-01 8.77282455e-01]
[9.99982544e-01 1.74558917e-05]
[9.99999996e-01 3.78412084e-09]
[5.61424124e-05 9.99943858e-01]
[9.82620365e-01 1.73796352e-02]
[1.94171276e-05 9.99980583e-01]
[9.83179110e-01 1.68208896e-02]
[9.99971324e-01 2.86761851e-05]
[1.86978428e-09 9.99999998e-01]
[7.06519691e-01 2.93480309e-01]
[8.80296829e-01 1.19703171e-01]
[9.87132431e-01 1.28675686e-02]
[1.65934444e-08 9.99999983e-01]
[9.99999995e-01 5.22521465e-09]
[9.97170829e-01 2.82917130e-03]
[9.99995505e-01 4.49460473e-06]
[9.97536742e-01 2.46325758e-03]
[1.17003871e-05 9.99988300e-01]
[2.75965897e-01 7.24034103e-01]
[4.72459215e-04 9.99527541e-01]
[9.99603650e-01 3.96350493e-04]
[9.99993266e-01 6.73357777e-06]
[9.95930147e-01 4.06985314e-03]
[9.98430108e-01 1.56989215e-03]
[1.02950719e-14 1.00000000e+00]
[9.95504721e-01 4.49527932e-03]
[9.88755899e-01 1.12441013e-02]
[5.76970096e-29 1.00000000e+00]
[9.99994030e-01 5.96964214e-06]
[8.04594587e-01 1.95405413e-01]
[2.73498848e-02 9.72650115e-01]
[9.98062495e-01 1.93750460e-03]
[9.99976044e-01 2.39560643e-05]
[5.51307112e-05 9.99944869e-01]
[9.99999921e-01 7.91559779e-08]
[9.99999870e-01 1.30319104e-07]
[9.99999957e-01 4.27677987e-08]
[8.68652943e-01 1.31347057e-01]
[9.99878314e-01 1.21685501e-04]
[9.97220154e-01 2.77984582e-03]
[9.99998005e-01 1.99475475e-06]
[1.12048195e-01 8.87951805e-01]
[9.98556552e-01 1.44344822e-03]
[4.74835052e-01 5.25164948e-01]
[8.85006321e-01 1.14993679e-01]
[9.99791624e-01 2.08376168e-04]
[7.14924653e-14 1.00000000e+00]
[9.11613644e-01 8.83863562e-02]
[9.99168495e-01 8.31504595e-04]
[9.99999999e-01 1.40036211e-09]
[8.95294053e-01 1.04705947e-01]
[9.94302903e-01 5.69709688e-03]
[2.58934387e-12 1.00000000e+00]
[9.99968031e-01 3.19694335e-05]
[1.00483240e-01 8.99516760e-01]
[9.99883869e-01 1.16131177e-04]
[9.99998999e-01 1.00050972e-06]
[9.41217448e-01 5.87825518e-02]
[9.99999144e-01 8.56187393e-07]
[3.04245716e-02 9.69575428e-01]
[9.99950646e-01 4.93539798e-05]
[9.94764447e-01 5.23555342e-03]
[9.99725874e-01 2.74126170e-04]
[9.99675935e-01 3.24064676e-04]
[9.99988898e-01 1.11021676e-05]
[4.96753226e-01 5.03246774e-01]
[9.92510677e-01 7.48932261e-03]
[3.71443861e-02 9.62855614e-01]
[9.99880010e-01 1.19989622e-04]
[5.43341069e-01 4.56658931e-01]
[2.23740846e-02 9.77625915e-01]
[2.16734441e-07 9.99999783e-01]
[8.28053468e-03 9.91719465e-01]
[6.42953055e-04 9.99357047e-01]
[2.31551620e-05 9.99976845e-01]
[9.99999954e-01 4.59849329e-08]
[9.99999995e-01 4.68993640e-09]
[9.99999996e-01 4.44118198e-09]
[1.95727744e-13 1.00000000e+00]
[9.97044910e-01 2.95509029e-03]
[1.00000000e+00 4.70059652e-11]
[4.91843429e-12 1.00000000e+00]
[9.99151819e-01 8.48180737e-04]
[9.99990247e-01 9.75330022e-06]
[9.99900851e-01 9.91486749e-05]
[5.19925707e-01 4.80074293e-01]
[9.98595280e-01 1.40472032e-03]
[9.99998228e-01 1.77218388e-06]
[9.70866147e-01 2.91338528e-02]
[7.48246743e-07 9.99999252e-01]
[9.99973845e-01 2.61547528e-05]
[1.20010195e-08 9.99999988e-01]
[9.99989583e-01 1.04166523e-05]
[3.28748185e-08 9.99999967e-01]
[9.99999966e-01 3.38654087e-08]
[5.05940992e-03 9.94940590e-01]
[9.99980420e-01 1.95803984e-05]
[9.99748556e-01 2.51444300e-04]
[9.99882532e-01 1.17467621e-04]
[9.98448749e-01 1.55125101e-03]
[9.99999884e-01 1.16129977e-07]
[1.53291657e-03 9.98467083e-01]
[9.76220766e-01 2.37792338e-02]
[8.44717999e-03 9.91552820e-01]
[9.99886188e-01 1.13812358e-04]
[3.90368809e-03 9.96096312e-01]
[9.99999858e-01 1.42492789e-07]
[9.99999698e-01 3.01588597e-07]
[9.30141719e-01 6.98582809e-02]
[9.99985365e-01 1.46346301e-05]
[9.99920174e-01 7.98259279e-05]
[9.99996587e-01 3.41321375e-06]
[9.96987845e-01 3.01215503e-03]
[9.99483133e-01 5.16866706e-04]
[9.91463705e-01 8.53629509e-03]
[3.15926974e-10 1.00000000e+00]
[2.14783690e-03 9.97852163e-01]
[7.50823010e-01 2.49176990e-01]
[9.99999137e-01 8.62819165e-07]
[6.99934104e-03 9.93000659e-01]
[5.46894966e-15 1.00000000e+00]
[2.13290238e-05 9.99978671e-01]
[9.99793159e-01 2.06840609e-04]
[8.74970112e-01 1.25029888e-01]
[9.99867579e-01 1.32420724e-04]
[1.00000000e+00 2.04284020e-10]
[9.99997195e-01 2.80502361e-06]
[9.99999157e-01 8.43225198e-07]
[9.99999948e-01 5.17726713e-08]
[2.40872121e-02 9.75912788e-01]
[9.47413892e-01 5.25861076e-02]
[6.54889629e-09 9.99999993e-01]
[3.40621192e-06 9.99996594e-01]
[4.30630753e-01 5.69369247e-01]
[9.99993574e-01 6.42617921e-06]
[9.98944910e-01 1.05508978e-03]
[9.16489524e-01 8.35104757e-02]
[9.99418072e-01 5.81928176e-04]
[5.40842574e-04 9.99459157e-01]
[9.99998061e-01 1.93903762e-06]
[3.47609654e-02 9.65239035e-01]
[9.99812899e-01 1.87101404e-04]
[9.99746283e-01 2.53716520e-04]
[2.81628794e-04 9.99718371e-01]
[9.99828951e-01 1.71048982e-04]
[9.99969589e-01 3.04105630e-05]
[8.55977708e-02 9.14402229e-01]
[3.44376719e-11 1.00000000e+00]
[9.99999699e-01 3.01499794e-07]
[9.99998707e-01 1.29347853e-06]
[9.99958928e-01 4.10716828e-05]
[9.93314781e-01 6.68521864e-03]
[3.21042931e-09 9.99999997e-01]
[9.86472042e-01 1.35279576e-02]
[9.99904973e-01 9.50269676e-05]
[2.83671375e-04 9.99716329e-01]
[9.98868733e-01 1.13126730e-03]
[2.44074089e-01 7.55925911e-01]
[4.94339246e-04 9.99505661e-01]
[3.39748678e-05 9.99966025e-01]
[9.99999809e-01 1.91303777e-07]
[1.16822919e-03 9.98831771e-01]
[9.99997637e-01 2.36342666e-06]]
模型评估
print('多项分布贝叶斯分类结果如下:')
print('验证集评分:')
print(clf_multi.score(X_test,y_test))
print("准确率:")
print(metrics.precision_score(y_test,y_pred))
print('混淆矩阵:')
print(metrics.confusion_matrix(y_true=y_test,y_pred=y_pred,labels=list(set(y))))
print(metrics.classification_report(y_test,y_pred))
多项分布贝叶斯分类结果如下:
验证集评分:
0.6233333333333333
准确率:
0.7537688442211056
混淆矩阵:
[[150 64]
[ 49 37]]
precision recall f1-score support
1 0.75 0.70 0.73 214
2 0.37 0.43 0.40 86
accuracy 0.62 300
macro avg 0.56 0.57 0.56 300
weighted avg 0.64 0.62 0.63 300
模型的准确率为75.38%,对模型验证集评分为62.33%,对于违规结果,召回率只有43%。因此使用多项分布的朴素贝叶斯,对于结果的预测还不够精准。
模型调优
下面考虑使用高斯朴素贝叶斯模型,对训练数据集进行拟合。
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print('高斯贝叶斯分类结果如下:')
print('验证集评分:')
print(clf.score(X_test,y_test))
print("准确率:")
print(metrics.precision_score(y_test,y_pred))
print('混淆矩阵:')
print(metrics.confusion_matrix(y_true=y_test,y_pred=y_pred,labels=list(set(y))))
print(metrics.classification_report(y_test,y_pred))
高斯贝叶斯分类结果如下:
验证集评分:
0.7233333333333334
准确率:
0.8046511627906977
混淆矩阵:
[[173 41]
[ 42 44]]
precision recall f1-score support
1 0.80 0.81 0.81 214
2 0.52 0.51 0.51 86
accuracy 0.72 300
macro avg 0.66 0.66 0.66 300
weighted avg 0.72 0.72 0.72 300
模型准确率提高到80.47%,对模型验证集评分提高到了72.33%,同时对于违规结果的召回率提高到了51%。
标签:信用风险,02,01,06,04,03,05,贝叶斯,朴素 来源: https://blog.csdn.net/Ashmore/article/details/116332482