python - 使用 ELMo 嵌入段落

Question

我试图了解如何为 ELMo 矢量化准备段落。

该文档仅显示了如何同时嵌入多个句子/单词。

例如。

sentences = [["the", "cat", "is", "on", "the", "mat"],
         ["dogs", "are", "in", "the", "fog", ""]]
elmo(
     inputs={
          "tokens": sentences,
          "sequence_len": [6, 5]
            },
     signature="tokens",
     as_dict=True
    )["elmo"]

据我了解，这将返回 2 个向量，每个向量代表一个给定的句子。我将如何准备输入数据以矢量化包含多个句子的整个段落。请注意，我想使用我自己的预处理。

可以这样做吗？

sentences = [["<s>" "the", "cat", "is", "on", "the", "mat", ".", "</s>", 
              "<s>", "dogs", "are", "in", "the", "fog", ".", "</s>"]]

或者可能是这样？

sentences = [["the", "cat", "is", "on", "the", "mat", ".", 
              "dogs", "are", "in", "the", "fog", "."]]

score 1 · Accepted Answer

ELMo 生成上下文词向量。所以一个词对应的词向量是这个词和上下文的函数，例如，句子，它出现在其中。

就像您在文档中的示例一样，您希望您的段落成为句子列表，即标记列表。所以你的第二个例子。要获得这种格式，您可以使用分spacy 词器

import spacy

# you need to install the language model first. See spacy docs.
nlp = spacy.load('en_core_web_sm')

text = "The cat is on the mat. Dogs are in the fog."
toks = nlp(text)
sentences = [[w.text for w in s] for s in toks.sents]

我认为您不需要在""第二句话中添加额外的填充物来sequence_len处理这一点。

更新：

据我了解，这将返回 2 个向量，每个向量代表一个给定的句子

不，这将为每个句子中的每个单词返回一个向量。如果您希望整个段落成为上下文（对于每个单词），只需将其更改为

sentences = [["the", "cat", "is", "on", "the", "mat", "dogs", "are", "in", "the", "fog"]]

和

...
"sequence_len": [11]

python - 使用 ELMo 嵌入段落

1 回答 1

Related

Reference