当我尝试使用 bio_clinical bert 获取句子的词嵌入时,对于 8 个单词的句子,我得到 11 个标记 id(+start 和 end),因为“embeddings”是词汇表外的单词/标记,它被拆分为em
, bed
, ding
, s
.
我想知道除了对这些向量进行平均之外,是否有任何可用的聚合策略有意义。
from transformers import AutoTokenizer, AutoModel
# download and load model
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
sentences = ['This framework generates embeddings for each input sentence']
#Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=128, return_tensors='pt')
#Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
print(encoded_input['input_ids'].shape)
输出:
torch.Size([1, 13])
for token in encoded_input['input_ids'][0]:
print(tokenizer.decode([token]))
输出:
[CLS]
this
framework
generates
em
##bed
##ding
##s
for
each
input
sentence
[SEP]