0

I have a classification problem where my labels are ratings, 0 - 100, with increments of 1 (e.g. 1, 2, 3, 4,).

I have a data set where each row has a name, text corpus, and a rating (0 - 100).

From the text corpus I am trying to extract features that I can feed into my classifier, which will output a corresponding rating per row (0 - 100).

For feature selection, I am thinking of starting with basic bag of words. My question lies in the classification algorithm, however. Is there a classification algorithm in sci-kit learn that supports this kind of problem?

I was reading http://scikit-learn.org/stable/modules/multiclass.html, but the algorithms described seem to support labels that are completely discrete, whereas I have a set of continuous labels.

EDIT: What about the case where I bin my ratings? For example, I can have 10 labels, each 1- 10.

4

2 回答 2

1

您可以使用OneHotEncoder预处理您的数据,将您的一个 1 到 100 特征转换为与区间 [1..100] 的每个值相对应的 100 个二进制特征。然后,您将拥有 100 个标签并学习一个多类分类器。

不过,我建议改用回归。

于 2014-11-04T08:25:46.843 回答
1

您可以使用多元回归而不是分类。U 可以将文本语料库中的 n-gram 特征聚类成一个字典,并用它来形成一个特征集。使用此功能集,训练一个回归模型,其中输出可以是连续值。u 可以将输出实数四舍五入得到1-100的离散标签

于 2014-11-04T13:00:45.120 回答