你需要两种不同的东西。
首先告诉 SpaCy 为您的文档、跨度或标记使用外部向量。这可以通过设置 user_hooks
: -user_hooks["vector"]
用于文档向量 -user_span_hooks["vector"]
用于跨度向量 -user_token_hooks["vector"]
用于标记向量
鉴于您有一个函数可以从 TF Hub 检索 Doc/Span/Token 的向量(它们都具有属性text
):
import spacy
import tensorflow_hub as hub
model = hub.load(TFHUB_URL)
def embed(element):
# get the text
text = element.text
# then get your vector back. The signature is for batches/arrays
results = model([text])
# get the first element because we queried with just one text
result = np.array(results)[0]
return result
您可以编写以下管道组件,它告诉 spacy 如何检索文档、跨度和令牌的自定义嵌入:
def overwrite_vectors(doc):
doc.user_hooks["vector"] = embed
doc.user_span_hooks["vector"] = embed
doc.user_token_hooks["vector"] = embed
# add this to your nlp pipeline to get it on every document
nlp = spacy.blank('en') # or any other Language
nlp.add_pipe(overwrite_vectors)
对于您与自定义距离相关的问题,还有一个用户挂钩:
def word_mover_similarity(a, b):
vector_a = a.vector
vector_b = b.vector
# your distance score needs to be converted to a similarity score
similarity = TODO_IMPLEMENT(vector_a, vector_b)
return similarity
def overwrite_similarity(doc):
doc.user_hooks["similarity"] = word_mover_similarity
doc.user_span_hooks["similarity"] = word_mover_similarity
doc.user_token_hooks["similarity"] = word_mover_similarity
# as before, add this to the pipeline
nlp.add_pipe(overwrite_similarity)
user_hooks
我有一个以这种方式使用的 TF Hub 通用句子编码器的实现: https ://github.com/MartinoMensio/spacy-universal-sentence-encoder-tfhub