nlp - 如何使用转换器模型获取词汇表外单词的词嵌入？

Question

当我尝试使用 bio_clinical bert 获取句子的词嵌入时，对于 8 个单词的句子，我得到 11 个标记 id（+start 和 end），因为“embeddings”是词汇表外的单词/标记，它被拆分为em, bed, ding, s.

我想知道除了对这些向量进行平均之外，是否有任何可用的聚合策略有意义。

from transformers import AutoTokenizer, AutoModel
# download and load model
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

sentences = ['This framework generates embeddings for each input sentence']


#Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=128, return_tensors='pt')


#Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

print(encoded_input['input_ids'].shape)

输出： torch.Size([1, 13])

for token in encoded_input['input_ids'][0]:
      print(tokenizer.decode([token]))

输出：

[CLS]
this
framework
generates
em
##bed
##ding
##s
for
each
input
sentence
[SEP]

score 2 · Accepted Answer

据我所知，均值聚合是这里最常用的工具，实际上甚至有科学文献，经验表明它效果很好： Generalizing Word Embeddings using Bag of Subwords by Zhao、Mudgal 和 Liang。公式 1 也准确地描述了您的提议。

理论上您可以采用的一种替代方法是对整个输入进行平均聚合，本质上是对所有单词（可能除了“”）进行“上下文预测” ，因此在变压器模型的训练过程中embeddings模拟类似于ing 的东西。[MASK]但这只是我的一个建议，没有任何科学证据证明它有效（无论好坏）。

nlp - 如何使用转换器模型获取词汇表外单词的词嵌入？

1 回答 1

Related

Reference