python - 在没有所有可能标签的情况下训练 sklearn LogisticRegression 分类器

Question

我正在尝试使用 scikit-learn 0.12.1 来：

训练一个 LogisticRegression 分类器
在保留的验证数据上评估分类器
向该分类器提供新数据并为每个观察检索 5 个最可能的标签

Sklearn 让这一切变得非常简单，除了一个特点。无法保证每个可能的标签都会出现在用于适合我的分类器的数据中。有数百个可能的标签，其中一些没有出现在可用的训练数据中。

这会导致两个问题：

当标签矢量化器出现在验证数据中时，它们无法识别以前看不见的标签。这很容易通过将贴标机拟合到一组可能的标签来解决，但它会加剧问题 2。
LogisticRegression 分类器的 predict_proba 方法的输出是一个 [n_samples, n_classes] 数组，其中 n_classes仅包含在训练数据中看到的类。这意味着在 predict_proba 数组上运行 argsort 不再提供直接映射到标签矢量化器词汇表的值。

我的问题是，强制分类器识别全部可能类的最佳方法是什么，即使其中一些类没有出现在训练数据中？显然，它在学习从未见过数据的标签时会遇到麻烦，但 0 在我的情况下非常有用。

score 8 · Accepted Answer

这是一种解决方法。确保你有一个名为的所有类的列表all_classes。那么，如果clf是你的LogisticRegression分类器，

from itertools import repeat

# determine the classes that were not present in the training set;
# the ones that were are listed in clf.classes_.
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes)

# the order of classes in predict_proba's output matches that in clf.classes_.
prob = clf.predict_proba(test_samples)
for row in prob:
    prob_per_class = (zip(clf.classes_, prob)
                    + zip(classes_not_trained, repeat(0.)))

产生一个(cls, prob)对列表。

score 3 · Accepted Answer

如果您想要的是一个类似 by 返回的数组predict_proba，但列对应于 sorted all_classes，那么：

all_classes = numpy.array(sorted(all_classes))
# Get the probabilities for learnt classes
prob = clf.predict_proba(test_samples)
# Create the result matrix, where all values are initially zero
new_prob = numpy.zeros((prob.shape[0], all_classes.size))
# Set the columns corresponding to clf.classes_
new_prob[:, all_classes.searchsorted(clf.classes_)] = prob

score 2 · Accepted Answer

基于 larsman 的出色回答，我得出了以下结论：

from itertools import repeat
import numpy as np

# determine the classes that were not present in the training set;
# the ones that were are listed in clf.classes_.
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes)

# the order of classes in predict_proba's output matches that in clf.classes_.
prob = clf.predict_proba(test_samples)
new_prob = []
for row in prob:
    prob_per_class = zip(clf.classes_, prob) + zip(classes_not_trained, repeat(0.))
    # put the probabilities in class order
    prob_per_class = sorted(prob_per_class)
    new_prob.append(i[1] for i in prob_per_class)
new_prob = np.asarray(new_prob)

new_prob 是一个 [n_samples, n_classes] 数组，就像 predict_proba 的输出一样，除了现在它包含先前未见过的类的 0 概率。

python - 在没有所有可能标签的情况下训练 sklearn LogisticRegression 分类器

3 回答 3

Related

Reference