1

我一直在阅读更多关于情感分类(分析)的现代文章,例如this

以 IMDB 数据集为例,我发现使用 Doc2Vec (88%) 获得了相似的准确率,但是使用带有三元语法的简单 tfidf 矢量化器进行特征提取的结果要好得多 (91%)。我认为这类似于Mikolov 2015 年论文中的表 2 。

我认为通过使用更大的数据集,这会改变。所以我从这里开始使用 1mill tr​​aining 和 1 mill test 的细分重新运行我的实验。不幸的是,在那种情况下,我的 tfidf vectoriser 特征提取方法增加到 93%,但 doc2vec 下降到 85%。

我想知道这是否可以预料,其他人是否发现 tfidf 即使对于大型语料库也优于 doc2vec?

我的数据清理很简单:

def clean_review(review):
    temp = BeautifulSoup(review, "lxml").get_text()
    punctuation = """.,?!:;(){}[]"""
    for char in punctuation
        temp = temp.replace(char, ' ' + char + ' ')
    words = " ".join(temp.lower().split()) + "\n"
    return words

我已经尝试为 Doc2Vec 模型使用 400 和 1200 功能:

model = Doc2Vec(min_count=2, window=10, size=model_feat_size, sample=1e-4, negative=5, workers=cores)

而我的 tfidf 矢量化器有 40,000 个最大特征:

vectorizer = TfidfVectorizer(max_features = 40000, ngram_range = (1, 3), sublinear_tf = True)

对于分类,我尝试了一些线性方法,但是发现简单的逻辑回归就可以了……

4

1 回答 1

3

The example code Mikolov once posted (https://groups.google.com/d/msg/word2vec-toolkit/Q49FIrNOQRo/J6KG8mUj45sJ) used options -cbow 0 -size 100 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 40 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1 – which in gensim would be similar to dm=0, dbow_words=1, size=100, window=10, hs=0, negative=5, sample=1e-4, iter=20, min_count=1, workers=cores.

My hunch is that optimal values might involve a smaller window and higher min_count, and maybe a size somewhere between 100 and 400, but it's been a while since I've run those experiments.

It can also sometimes help a little to re-infer vectors on the final model, using a larger-than-the-default passes parameter, rather than re-using the bulk-trained vectors. Still, these may just converge on similar performance to Tfidf – they're all dependent on the same word-features, and not very much data.

Going to a semi-supervised approach, where some of the document-tags represent sentiments where known, sometimes also helps.

于 2016-07-29T02:33:21.340 回答