我正在使用拥抱脸的 BERTweet 实现(https://huggingface.co/docs/transformers/model_doc/bertweet),我想对一些推文进行编码并转发它们以进行进一步处理(预测)。问题是当我尝试编码一个相对较长的句子时,模型会引发错误。
例子:
import torch
from transformers import AutoModel, AutoTokenizer
from DataLoader import DataLoader
bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", normalization=True) # Automatic normalization of tweets by enabling normalization
line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry: SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry: SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry: SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry: SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry: SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry: SC has first two presumptive cases of coronavirus , DHEC confirms "
input_ids = torch.tensor([tokenizer.encode(line)])
print(input_ids)
with torch.no_grad():
features = bertweet(input_ids)
控制台输出:
RuntimeError: The expanded size of the tensor (136) must match the existing size (130) at non-singleton dimension 1. Target sizes: [1, 136]. Tensor sizes: [1, 130]
但是,如果您将其更改line
为:
line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"
,然后模型成功地对句子进行编码。这是预期的行为吗?我知道 BERT 一个句子最多有 512 个单词,而 BERTweet 基本上是一个微调过的 BERT。剪掉较长的句子是个好主意吗,这对我的问题来说是一个可以接受的解决方案吗?提前致谢。