Consider this runnable example:
#coding: utf-8
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = ['öåa hej ho' 'åter aba na', 'äs äp äl']
x = vectorizer.fit_transform(corpus)
l = vectorizer.get_feature_names()
for u in l:
print u
The output will be
aba
hej
ho
na
ter
Why is the åäö removed? Note that the vectorizer strip_accents=None is default. I would be really grateful if you could help me with this.