python - Tf-idf 矢量化器在带有 char_wb 的特征词中有空格？

Question

我用

singleTFIDF = TfidfVectorizer(analyzer='char_wb', ngram_range= 
(4,6),stop_words=my_stop_words, max_features=50).fit([text])

并想知道为什么我的功能中有空格，例如“chaft”

我怎样才能避免这种情况？我需要自己进行 tekenize 和预处理吗？

score 0 · Accepted Answer

使用analyzer='word'.

当我们使用时char_wb，向量化器会填充空白，因为它不会对使用进行检查的单词进行标记character_n_grams。

根据文档：

分析器：字符串、{'word'、'char'、'char_wb'} 或可调用

特征是否应该由单词或字符 n-gram 组成。选项 'char_wb' 仅从单词边界内的文本创建字符 n-gram；单词边缘的 n-gram 用空格填充。

看下面的例子，了解用法

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer(analyzer='char_wb', ngram_range= (4,6))
X = vectorizer.fit_transform(corpus)
print([(len(w),w) for w in vectorizer.get_feature_names()])

输出：

耳鼻喉科？'), (4, 'firs'), (5, 'first'), (6, 'first'), (4, 'hird'), (5, 'hird'), (4, 'his') , (4, 'ird'), (4, 'irst'), (5, 'irst'), (4, 'ment'), (5, 'ment'), (5, 'ment.'), (6, 'ment.'), (5, 'ment?'), (6, 'ment?'), (4, 'ne.'), (4, 'nt.'), (4, 'nt ?'), (4, 'ocum'), (5, 'ocume'), (6, 'ocumen'), (4, 'ond'), (4, 'one.'), (5, 'one . '), (4, 'rst'), (4, 'seco'), (5, 'secon'), (6, 'second'), (4, 'the'), (4, 'thir' ), (5, '第三'), (6, '第三'), (4, 'this'), (5, 'this'), (4, 'umen'), (5, 'ument'), (6, 'ument'), (6, 'ument.'), (6,

python - Tf-idf 矢量化器在带有 char_wb 的特征词中有空格？

1 回答 1

Related

Reference