好,我知道了。正如我所说,这个想法是只计算新批次文件和现有文件之间的相似度,它们的相似度没有变化。问题是使用新出现的术语来更新 TfidfVectorizer 的词汇表。
解决方案有 2 个步骤:
- 更新词汇表和 tf 矩阵。
- 矩阵乘法和堆叠。
这是整个脚本——我们首先得到了原始语料库以及经过训练和计算的对象和矩阵:
corpus = [doc1, doc2, doc3]
# Build for the first time:
vect = TfidfVectorizer(min_df=1, stop_words="english", use_idf=False)
tf_matrix = vect.fit_transform(corpus)
similarities = tf_matrix * tf_matrix.T
similarities_matrix = similarities.A # just for printing
现在,给定新文件:
new_docs_corpus = [docx, docy, docz] # New documents
# Building new vectorizer to create the parsed vocabulary of the new documents:
new_vect = TfidfVectorizer(min_df=1, stop_words="english", use_idf=False)
new_vect.fit(new_docs_corpus)
# Merging old and new vocabs:
new_terms_count = 0
for k, v in new_vect.vocabulary_.items():
if k in vect.vocabulary_.keys():
continue
vect.vocabulary_[k] = np.int64(len(vect.vocabulary_)) # important not to assign a simple int
new_terms_count = new_terms_count + 1
new_vect.vocabulary_ = vect.vocabulary_
# Build new docs represantation using the merged vocabulary:
new_tf_matrix = new_vect.transform(new_docs_corpus)
new_similarities = new_tf_matrix * new_tf_matrix.T
# Get the old tf-matrix with the same dimentions:
if new_terms_count:
zero_matrix = csr_matrix((tfidf.shape[0],new_terms_count))
tf_matrix = hstack([tf_matrix, zero_matrix])
# tf_matrix = vect.transform(corpus) # Instead, we just append 0's for the new terms and stack the tf_matrix over the new one, to save time
cross_similarities = new_tf_matrix * tf_matrix.T # Calculate cross-similarities
tf_matrix = vstack([tf_matrix, new_tfidf])
# Stack it all together:
similarities = vstack([hstack([similarities, cross_similarities.T]), hstack([cross_similarities, new_similarities])])
similarities_matrix = similarities.A
# Updating the corpus with the new documents:
corpus = corpus + new_docs_corpus
我们可以通过比较similarities_matrix
我们得到的计算结果和我们TfidfVectorizer
在联合语料库上训练 a 时得到的结果来检查这一点corpus + new_docs_corpus
:
正如评论中所讨论的,我们可以做所有这些只是因为我们没有使用 idf(逆文档频率)元素,这将改变给定新文档的现有文档的表示。