python - sklearn：想扩展 CountVectorizer 以对词汇进行模糊匹配

Question

我打算尝试使用带有经过调整的可接受分数参数的fuzzywuzzy，它基本上会检查单词是否在词汇表中，如果不是，它会要求fuzzywuzzy选择最佳模糊匹配，并接受它作为列表如果它至少是某个分数，则标记。

如果这不是处理大量拼写错误和拼写略有不同但相似的单词的最佳方法，我愿意接受建议。

问题是子类一直抱怨它有一个空的词汇表，这没有任何意义，因为当我在代码的同一部分使用常规计数矢量化器时它工作正常。

它会吐出许多这样的错误：ValueError：空词汇；也许文档只包含停用词

我错过了什么？我还没有让它做任何特别的事情。它应该像往常一样工作：

class FuzzyCountVectorizer(CountVectorizer):
    def __init__(self, input='content', encoding='utf-8', decode_error='strict',
                 strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None,
                 token_pattern="(?u)\b\w\w+\b", ngram_range=(1, 1), analyzer='word',
                 max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False,
                 dtype=numpy.int64, min_fuzzy_score=80):
        super().__init__(
            input=input, encoding=encoding, decode_error=decode_error, strip_accents=strip_accents,
            lowercase=lowercase, preprocessor=preprocessor, tokenizer=tokenizer, stop_words=stop_words,
            token_pattern=token_pattern, ngram_range=ngram_range, analyzer=analyzer, max_df=max_df,
            min_df=min_df, max_features=max_features, vocabulary=vocabulary, binary=binary, dtype=dtype)
        # self._trained = False
        self.min_fuzzy_score = min_fuzzy_score

    @staticmethod
    def remove_non_alphanumeric_chars(s: str) -> 'str':
        pass

    @staticmethod
    def tokenize_text(s: str) -> 'List[str]':
        pass

    def fuzzy_repair(self, sl: 'List[str]') -> 'List[str]':
        pass

    def fit(self, raw_documents, y=None):
        print('Running FuzzyTokenizer Fit')
        #TODO clean up input
        super().fit(raw_documents=raw_documents, y=y)
        self._trained = True
        return self

    def transform(self, raw_documents):
        print('Running Transform')
        #TODO clean up input
        #TODO fuzzyrepair
        return super().transform(raw_documents=raw_documents)

score 3 · Accepted Answer

scikit-learn 的原始函数定义CountVectorizer有

token_pattern=r"(?u)\b\w\w+\b"

而在您的子类中，您不使用转义r字符串前缀，因此会出现此问题。此外，与其复制所有__init__参数，不如使用它可能更容易，

def __init__(self, *args, **kwargs):
     self.min_fuzzy_score = kwargs.pop('min_fuzzy_score', 80)
     super().__init__(*args, **kwargs)

至于这是否是最好的方法，这取决于数据集的大小。对于具有总数N_words和词汇量的文档集，N_vocab_size这种方法需要O(N_words*N_vocab_size)繁琐的单词比较。如果您使用标准对数据集进行矢量化，CountVectorizer然后通过模糊匹配减少计算的词汇表（和袋词矩阵），则需要“仅”O(N_vocab_size**2)比较。

对于超过 10,000 个单词的词汇，这可能仍然无法很好地扩展。如果您打算在生成的稀疏数组上应用一些机器学习算法，您可能还想尝试字符 n-gram，它在印刷错误方面也有一定的鲁棒性。

python - sklearn：想扩展 CountVectorizer 以对词汇进行模糊匹配

1 回答 1

Related

Reference