首页 > 编程语言> > python – 预测类或类概率?

python – 预测类或类概率?


我目前正在使用H2O作为分类问题数据集.我在python 3.6环境中使用H2ORandomForestEstimator测试它.我注意到预测方法的结果给出了0到1之间的值(我假设这是概率).






原则上&在理论上,硬&软分类(即分别返回类和概率)是不同的方法,每个方法都有其自身的优点和优点.缺点.例如,从文件Hard or Soft Classification? Large-margin Unified Machines中考虑以下内容:

Margin-based classifiers have been popular in both machine learning and statistics for classification problems. Among numerous classifiers, some are hard classifiers while some are soft ones. Soft classifiers explicitly estimate the class conditional probabilities and then perform classification based on estimated probabilities. In contrast, hard classifiers directly target on the classification decision boundary without producing the probability estimation. These two types of classifiers are based on different philosophies and each has its own merits.

也就是说,在实践中,今天使用的大多数分类器,包括随机森林(我能想到的唯一例外是SVM系列)实际上是软分类器:它们实际上在下面生成的是类似概率的度量,随后,结合隐式阈值(在二进制情况下通常默认为0.5),给出硬类成员资格,如0/1或True / False.

What is the right way to get the classified prediction result?


一般来说,鉴于你的分类器实际上是一个软分类,只得到最后的硬分类(真/假)给这个过程带来了“黑盒子”的味道,这原则上应该是不合需要的;直接处理产生的概率,并且(重要的!)明确地控制决策阈值应该是这里的优选方式.根据我的经验,这些是新的从业者经常失去的微妙之处;从Cross Validated线程Classification probability threshold中考虑以下内容:

the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.

除了如上所述的“软”参数(双关语)之外,还有一些情况需要直接处理基础概率和阈值,即二进制分类中默认阈值为0.5的情况会导致您误入歧途,最明显的是当您的课程是不平衡的;请参阅我在High AUC but bad predictions with imbalanced data的答案(以及其中的链接),了解这种情况的具体例子.


If probabilities are the outcomes for numerical target values, then how do I handle it in case of a multiclass classification?


the predicted class is the one with highest mean probability estimate

也就是说,对于3个类(0,1,2),你得到[p0,p1,p2]的估计(根据概率规则,元素总和为1),预测的类是与最高概率,例如对于[0.12,0.60,0.28]的情况,类#1.这是一个带有3级虹膜数据集的reproducible example(它用于GBM算法和R中,但基本原理是相同的).

来源: https://codeday.me/bug/20190929/1832106.html