python - 将表情符号合并到 scikit 模型中

Question

我正在使用 scikit 在文本数据集上训练 SVM 分类器。该文档适用于使用计数矢量化器使用 n-gram 构造特征向量。例如，对于 unigrams 和 bigrams，我可以执行以下操作：

   CountVectorizer(ngram_range(1,2))

但是，我不确定您将如何将表情符号构建到特征向量中？似乎有两个可用选项 - 要么使用匹配表情符号的正则表达式并将其输入

token_pattern

CountVectorizer 的参数，或构建一个包含表情符号的自定义词汇表，并将其输入

vocabulary

争论。任何建议 - 或者特别是一个简单的例子，都会很棒！另外，如果我错过了任何其他重要信息，请告诉我。

编辑：我的解决方案

在对上述问题进行了一些实验之后，这是对我有用的代码。它假定您已将数据拆分为数组，例如：

training_data, training_labels, test_data, test_labels

我们使用 CountVectorizer，所以首先导入：

from sklearn.feature_extraction.text import CountVectorizer
c_vect = CountVectorizer()

然后将表情符号列表构建为数组。（我从网上的文本转储中得到了我的清单）：

emoticon_list = [ ':)', ':-)', ':(' .... etc. - put your long list of emoticons here]

接下来，将 CountVectorizer 与表情符号数组相匹配。使用 fit 而不是 fit_transform 至关重要：

X = c_vect.fit(emoticon_list)

然后使用 transform 方法通过计算训练数据（在我的例子中是推文数组）中表情符号的数量来构造一个特征向量：

emoticon_training_features = c_vect.transform(training_data)

现在我们可以使用标签和新的表情特征向量来训练我们的分类器 clf（请记住，对于某些分类器，例如 SVC，您需要首先将字符串标签转换为适当的数字）：

clf.fit(emoticon_training_features, training_labels)

然后为了评估分类器的性能，我们必须转换我们的测试数据以利用可用的表情特征：

emoticon_test_features = c_vect.transform(test_data)

最后，我们可以执行我们的预测：

predicted = clf.predict(emoticon_test_features)

完毕。此时评估性能的一种相当标准的方法是使用：

from sklearn.metrics import classification_report
print classification_report(test_labels, predicted)

呸。希望有帮助。

score 2 · Accepted Answer

两种选择都应该有效。

还有第三种选择，即手动标记您的样本并将它们提供给 aDictVectorizer而不是 a CountVectorizer。使用最简单的标记器的示例是str.split：

>>> from collections import Counter
>>> from sklearn.feature_extraction import DictVectorizer
>>> vect = DictVectorizer()
>>> samples = [":) :) :)", "I have to push the pram a lot"]
>>> X = vect.fit_transform(Counter(s.split()) for s in samples)
>>> X
<2x9 sparse matrix of type '<type 'numpy.float64'>'
    with 9 stored elements in Compressed Sparse Row format>
>>> vect.vocabulary_
{'a': 2, ':)': 0, 'I': 1, 'to': 8, 'have': 3, 'lot': 4, 'push': 6, 'the': 7, 'pram': 5}
>>> vect.inverse_transform(X[0])  # just for inspection
[{':)': 3.0}]

但是，DictVectorizer您必须构建自己的二元组。

python - 将表情符号合并到 scikit 模型中

编辑：我的解决方案

1 回答 1

Related

Reference