我用我自己的未标记数据对变形器进行预训练,如下所示:
python train_mlm.py sentence-transformers/LaBSE train.txt
基于https://github.com/UKPLab/sentence-transformers/tree/master/examples/unsupervised_learning/MLM
然后我想获得 setnences 的嵌入。代码:
model = AutoModelForMaskedLM.from_pretrained('output/sentence-transformers_LaBSE-2021-12-28_13-03-20')
tokenizer = AutoTokenizer.from_pretrained('output/sentence-transformers_LaBSE-2021-12-28_13-03-20')
model = model.eval()
english_sentences = [
"dog",
"Puppies are nice.",
"I enjoy taking long walks along the beach with my dog.",
]
encoded_input = tokenizer(english_sentences, padding=True, truncation=True, max_length=64, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
print(model_output[0].shape)
问题是我的输出形状类似于 (3, 14, 500 000)。如果没有对我的数据形状进行培训,则为 (3, 14, 768)。我做错了什么?训练后如何获得最终嵌入?