nlp - 当我有它们的向量时如何对关键词进行聚类或获取关键词相似度

Question

我有一个 python 字典，使用 Pickle 方法（通过 Bert-as-Service 和 Google 的预训练模型）存储为 Vector 文件，例如：

(key)Phrase : (value)Phrase_Vector_from_Bert = 女布 : 1.3237 -2.6354 1.7458 ....

但是我不知道像使用 Gensim Word2Vec 那样从 Bert-as-Service 模型中获取短语与向量文件的相似性，因为后者配备了 .similarity 方法。

您能否提供一个建议以获取短语/关键字相似性或将它们与我的 python-Pickle-dictionary 向量文件进行聚类？

或者也许有更好的办法用 Bert-as-Service 来聚类关键字？

以下代码显示了如何获取短语/关键字的向量：

import Myutility
# the file Myutility includes the function save_model and load_model

import BertCommand
# the file Bertcommand includes the function to start Bert-as-service 
  client

WORD_PATH = 'E:/Works/testwords.txt'
WORD_FEATURE = 'E:/Works/word.google.vector'

word_vectors = {}

with open(WORD_PATH) as f:
    lines = f.readlines()
    for line in lines:
        line = line.strip('\n')
        if line:                
            word = line
            print(line)
            word_vectors[word]=None

for word in word_vectors:
    try:
        v = bc.encode([word])
        word_vectors[word] = v
    except:
        pass

save_model(word_vectors,WORD_FEATURE)

score 0 · Accepted Answer

如果我理解得很好，你还有每个短语的向量。

然后，您可以简单地计算两个短语向量之间的余弦相似度。

更多细节和实现（手动实现和sklearn实现），我建议这个链接：https ://skipperkongen.dk/2018/09/19/cosine-similarity-in-python/

nlp - 当我有它们的向量时如何对关键词进行聚类或获取关键词相似度

1 回答 1

Related

Reference