首页 > 编程语言> > python – 预测类或类概率？

python – 预测类或类概率？

2019-09-29 15:56:19 作者：互联网

我目前正在使用H2O作为分类问题数据集.我在python 3.6环境中使用H2ORandomForestEstimator测试它.我注意到预测方法的结果给出了0到1之间的值(我假设这是概率).

在我的数据集中,目标属性是数字,即True值是1,False值是0.我确保我将类型转换为目标属性的类别,我仍然得到相同的结果.

然后我修改了代码,使用H2OFrame上的asfactor()方法将目标列转换为factor,结果没有任何变化.

但是当我将目标属性中的值分别更改为1和0时的True和False时,我得到了预期的结果(即)输出是分类而不是概率.

>获得分类预测结果的正确方法是什么？
>如果概率是数值目标值的结果,那么在多类分类的情况下如何处理它？

解决方法:

原则上&在理论上,硬&软分类(即分别返回类和概率)是不同的方法,每个方法都有其自身的优点和优点.缺点.例如,从文件Hard or Soft Classification? Large-margin Unified Machines中考虑以下内容：

Margin-based classifiers have been popular in both machine learning and statistics for classification problems. Among numerous classifiers, some are hard classifiers while some are soft ones. Soft classifiers explicitly estimate the class conditional probabilities and then perform classification based on estimated probabilities. In contrast, hard classifiers directly target on the classification decision boundary without producing the probability estimation. These two types of classifiers are based on different philosophies and each has its own merits.

也就是说,在实践中,今天使用的大多数分类器,包括随机森林(我能想到的唯一例外是SVM系列)实际上是软分类器：它们实际上在下面生成的是类似概率的度量,随后,结合隐式阈值(在二进制情况下通常默认为0.5),给出硬类成员资格,如0/1或True / False.

What is the right way to get the classified prediction result?

对于初学者来说,总是可以从概率转向艰难的阶级,但事实恰恰相反.

一般来说,鉴于你的分类器实际上是一个软分类,只得到最后的硬分类(真/假)给这个过程带来了“黑盒子”的味道,这原则上应该是不合需要的;直接处理产生的概率,并且(重要的！)明确地控制决策阈值应该是这里的优选方式.根据我的经验,这些是新的从业者经常失去的微妙之处;从Cross Validated线程Classification probability threshold中考虑以下内容：

the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.

除了如上所述的“软”参数(双关语)之外,还有一些情况需要直接处理基础概率和阈值,即二进制分类中默认阈值为0.5的情况会导致您误入歧途,最明显的是当您的课程是不平衡的;请参阅我在High AUC but bad predictions with imbalanced data的答案(以及其中的链接),了解这种情况的具体例子.

说实话,我对你报告的H2O的行为感到惊讶(我没有亲自使用它),即输出的类型受输入表示的影响;情况应该不是这样,如果确实如此,我们可能会遇到设计不良的问题.比较scikit-learn中的随机森林分类器,它包括两种不同的方法,predict和predict_proba,分别得到硬分类和基础概率(并检查文档,很明显,预测的输出是基于概率估计,已经计算过).

If probabilities are the outcomes for numerical target values, then how do I handle it in case of a multiclass classification?

除了简单的门槛不再有意义之外,原则上没有任何新的东西;再次,来自随机森林predict文档中的scikit-learn：

the predicted class is the one with highest mean probability estimate

也就是说,对于3个类(0,1,2),你得到[p0,p1,p2]的估计(根据概率规则,元素总和为1),预测的类是与最高概率,例如对于[0.12,0.60,0.28]的情况,类#1.这是一个带有3级虹膜数据集的reproducible example(它用于GBM算法和R中,但基本原理是相同的).

标签：random-forest,python,machine-learning,classification,h2o
来源： https://codeday.me/bug/20190929/1832106.html