python - 是否可以为 CountVectorizer 创建与 Scikit-learn 中的 DictVectorizer 可用的等效“限制”方法？

Question

对于 DictVectorizer，可以使用 restrict() 方法对对象进行子集化。这是一个示例，其中我使用布尔数组明确列出了要保留的功能。

import numpy as np
v = DictVectorizer()
D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}]
X = v.fit_transform(D)
v.get_feature_names()
>>['bar', 'baz', 'foo']
user_list = np.array([False, False, True], dtype=bool)
v.restrict(user_list)
v.get_feature_names()
>>['foo']

我想在非规范化的 CountVectorizer 对象中具有相同的能力。我还没有发现任何方法来分割来自 CountVectorizer 的 np 对象，因为有许多依赖属性。我感兴趣的原因是，这消除了在文本数据的第一次拟合和转换后简单地删除特征的场景下重复拟合和转换文本数据的需要。是否有我缺少的等效方法或者可以为 CountVectorizer 轻松创建自定义方法？

更新基于@Vivek 的回复

这种方法似乎有效。这是我在 python 会话中直接实现它的代码。

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
v = CountVectorizer()
D = ['Data science is about the data', 'The science is amazing', 'Predictive modeling is part of data science']
v.fit_transform(D)
print(v.get_feature_names())
print(len(v.get_feature_names()))
>> ['about', 'amazing', 'data', 'is', 'modeling', 'of', 'part', 'predictive', 'science', 'the']
>> 10
user_list = np.array([False, False, True, False, False, True, False, False, True, False], dtype=bool)

new_vocab = {}
for i in np.where(user_list)[0]:
    print(v.get_feature_names()[i])
    new_vocab[v.get_feature_names()[i]] = len(new_vocab)
new_vocab

>> data
>> of
>> science
>> {'data': 0, 'of': 1, 'science': 2}

v_copy = cp.deepcopy(v)
v_copy.vocabulary_ = new_vocab
print(v_copy.vocabulary_)
print(v_copy.get_feature_names())
v_copy

>> {'data': 0, 'of': 1, 'science': 2}
>> ['data', 'of', 'science']
>> CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

v_copy.transform(D).toarray()
>> array([[2, 0, 1],
       [0, 0, 1],
       [1, 1, 1]], dtype=int64)

谢谢@Vivek！对于非规范化的 CountVectorizer 对象，这似乎与预期的一样。

score 1 · Accepted Answer

以对原始问题的评论形式回答实施@Vivek 的建议：

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
v = CountVectorizer()
D = ['Data science is about the data', 'The science is amazing', 'Predictive modeling is part of data science']
v.fit_transform(D)
print(v.get_feature_names())
print(len(v.get_feature_names()))
>> ['about', 'amazing', 'data', 'is', 'modeling', 'of', 'part', 'predictive', 'science', 'the']
>> 10
user_list = np.array([False, False, True, False, False, True, False, False, True, False], dtype=bool)

new_vocab = {}
for i in np.where(user_list)[0]:
    print(v.get_feature_names()[i])
    new_vocab[v.get_feature_names()[i]] = len(new_vocab)
new_vocab

>> data
>> of
>> science
>> {'data': 0, 'of': 1, 'science': 2}

v_copy = cp.deepcopy(v)
v_copy.vocabulary_ = new_vocab
print(v_copy.vocabulary_)
print(v_copy.get_feature_names())
v_copy

>> {'data': 0, 'of': 1, 'science': 2}
>> ['data', 'of', 'science']
>> CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

v_copy.transform(D).toarray()
>> array([[2, 0, 1],
       [0, 0, 1],
       [1, 1, 1]], dtype=int64)

score 0 · Accepted Answer

您可以将一个矢量化器的词汇分配或限制到另一个矢量化器，如下所示：

from sklearn.feature_extraction.text import CountVectorizer
    
count_vect1 = CountVectorizer()
count_vect1.fit(list_of_strings1)

count_vect2 = CountVectorizer(vocabulary=count_vect1.vocabulary_)
count_vect2.fit(list_of_strings2)

答案改编自：ValueError: Dimension mismatch

python - 是否可以为 CountVectorizer 创建与 Scikit-learn 中的 DictVectorizer 可用的等效“限制”方法？

更新基于@Vivek 的回复

2 回答 2

Related

Reference