python - 机器学习文本分类

Question

我正在使用 Python 编写一个迷你项目分类文本。
这个想法很简单：我们有一个句子语料库，分别属于 J. Chirac 和 Mitterrand（法兰西共和国的 2 位前总统（带有相关标签）。
目标是建立一个预测属于不同句子的模型。对于类（标签）它有“M”代表密特朗，“C”代表希拉克，在我的程序中我认为正确M == > -1，并且C ==> 1。
最后，我在我的数据集上应用了一个称为朴素贝叶斯的聚类算法，并对新数据进行了预测（测试）。
这里的问题是，在对我的系统进行性能评估后，我得到了一个非常低的分数，虽然我使用了几种方法来增加（停用词、双连词、平滑..）

如果有人对我有其他想法或建议来改进我的系统的性能，我会非常满意。

我将在下面附上我的一些代码。

在下面的代码中，我选择了我的停止列表，我删除了不是很重要的单词和拆分器来生成我的语料库，我使用了二元组：

stoplist = set('le la les de des à un une en au ne ce d l c s je tu il que qui mais quand'.split())
stoplist.add('')
splitters = u'; |, |\*|\. | |\'|'
liste = (re.split(splitters, doc.lower()) for doc in alltxts) # generator = pas de place en memoire
dictionary = corpora.Dictionary([u"{0}_{1}".format(l[i],l[i+1]) for i in xrange(len(l)-1)] for l in liste) # bigrams
print len(dictionary)
stop_ids = [dictionary.token2id[stopword] for stopword in stoplist   if stopword in dictionary.token2id]
once_ids = [tokenid for tokenid, docfreq in dictionary.dfs.iteritems() if docfreq < 10 ]
dictionary.filter_tokens(stop_ids + once_ids) # remove stop words and words that appear only once
dictionary.compactify() # remove gaps in id sequence after words that were removed
print len(dictionary)
liste = (re.split(splitters, doc.lower()) for doc in alltxts) # ATTENTION: quand le générator a déjà servi, il ne se remet pas au début => le re-créer pour plus de sécurité 
alltxtsBig = ([u"{0}_{1}".format(l[i],l[i+1]) for i in xrange(len(l)-1)] for l in liste)
corpusBig = [dictionary.doc2bow(text) for text in alltxtsBig]

在这里，我为我的测试数据集生成了一个语料库：

liste_test = (re.split(splitters, doc.lower()) for doc in alltxts_test)
alltxtsBig_test = ([u"{0}_{1}".format(l[i],l[i+1]) for i in xrange(len(l)-1)] for l in liste_test)
corpusBig_test = [dictionary.doc2bow(text) for text in alltxtsBig_test]
and here I am doing the processing of these data has a numpy matrix, and I apply the algorithm on data, and I make the prediction on test data:


dataSparse = gensim.matutils.corpus2csc(corpusBig)
dataSparse_test = gensim.matutils.corpus2csc(corpusBig_test)
import sklearn.feature_extraction.text as txtTools #.TfidfTransformer
t = txtTools.TfidfTransformer()
t.fit(dataSparse.T)
data2 = t.transform(dataSparse.T)
data_test = t.transform(dataSparse_test.T)
nb_classifier = MultinomialNB().fit(data2, labs)
y_nb_predicted = nb_classifier.predict(data_test)

编辑：
我系统的性能值为 0.28。通常，如果系统有效，它将给出超过 0.6。
我在一个文件 Millers 句子上工作，我声明了 gensim，我没有在这里粘贴所有代码，因为它很长，我的问题是是否有其他方法可以提高系统性能，我使用了二元组，平滑.. 仅此而已.

score 0 · Accepted Answer

欢迎来到stackoverflow。首先，你确定你的表现很差吗？你甚至没有说你得到了什么表现，但是如果（正如你似乎在说）你试图根据一个句子来识别作者，我不希望它有任何可靠性. 作者识别通常在更长的文本上完成。

恐怕您的代码既不完整（在哪里gensim定义？所有这些库函数是做什么的？）而且太长而难以理解。但是您是否使用文本中所有（非停用词）二元组的存在作为分类器的特征？这是很多功能，而且它们都是同一种（bigrams）。您可以尝试添加一些不同类型的特征，和/或更有选择地使用二元特征以避免过度训练。您应该仔细阅读以了解哪种方法可能有效——作者识别并不是一项新任务。

您的问题有点过于宽泛，无法有效回答，因为可能的答案太多了。但是，当您对此进行更多工作时，请坚持并提出更具体的问题。祝你好运！

python - 机器学习文本分类

1 回答 1

Related

Reference