1

我用

singleTFIDF = TfidfVectorizer(analyzer='char_wb', ngram_range= 
(4,6),stop_words=my_stop_words, max_features=50).fit([text])

并想知道为什么我的功能中有空格,例如“chaft”

我怎样才能避免这种情况?我需要自己进行 tekenize 和预处理吗?

4

1 回答 1

0

使用analyzer='word'.

当我们使用 时char_wb,向量化器会填充空白,因为它不会对使用 进行检查的单词进行标记character_n_grams

根据文档:

分析器:字符串、{'word'、'char'、'char_wb'} 或可调用

特征是否应该由单词或字符 n-gram 组成。选项 'char_wb' 仅从单词边界内的文本创建字符 n-gram;单词边缘的 n-gram 用空格填充。

看下面的例子,了解用法

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer(analyzer='char_wb', ngram_range= (4,6))
X = vectorizer.fit_transform(corpus)
print([(len(w),w) for w in vectorizer.get_feature_names()])

输出:

耳鼻喉科?'), (4, 'firs'), (5, 'first'), (6, 'first'), (4, 'hird'), (5, 'hird'), (4, 'his') , (4, 'ird'), (4, 'irst'), (5, 'irst'), (4, 'ment'), (5, 'ment'), (5, 'ment.'), (6, 'ment.'), (5, 'ment?'), (6, 'ment?'), (4, 'ne.'), (4, 'nt.'), (4, 'nt ?'), (4, 'ocum'), (5, 'ocume'), (6, 'ocumen'), (4, 'ond'), (4, 'one.'), (5, 'one . '), (4, 'rst'), (4, 'seco'), (5, 'secon'), (6, 'second'), (4, 'the'), (4, 'thir' ), (5, '第三'), (6, '第三'), (4, 'this'), (5, 'this'), (4, 'umen'), (5, 'ument'), (6, 'ument'), (6, 'ument.'), (6,

于 2019-01-22T13:13:12.903 回答