6

Consider this runnable example:

#coding: utf-8
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
corpus = ['öåa hej ho' 'åter aba na', 'äs äp äl']
x = vectorizer.fit_transform(corpus)
l =  vectorizer.get_feature_names()

for u in l:
        print u

The output will be

aba
hej
ho
na
ter

Why is the åäö removed? Note that the vectorizer strip_accents=None is default. I would be really grateful if you could help me with this.

4

1 回答 1

11

这是一种有意降低维度的方法,同时使矢量化器能够容忍作者并不总是与使用重音字符一致的输入。

如果您想禁用该功能,只需按照此类文档中的说明传递strip_accents=None给。CountVectorizer

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> CountVectorizer(strip_accents='ascii').build_analyzer()(u'\xe9t\xe9')
[u'ete']
>>> CountVectorizer(strip_accents=False).build_analyzer()(u'\xe9t\xe9')
[u'\xe9t\xe9']
>>> CountVectorizer(strip_accents=None).build_analyzer()(u'\xe9t\xe9')
[u'\xe9t\xe9']
于 2013-04-18T12:37:32.667 回答