1

我正在使用拥抱脸的 BERTweet 实现(https://huggingface.co/docs/transformers/model_doc/bertweet),我想对一些推文进行编码并转发它们以进行进一步处理(预测)。问题是当我尝试编码一个相对较长的句子时,模型会引发错误。

例子:

import torch
from transformers import AutoModel, AutoTokenizer
from DataLoader import DataLoader
   
bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", normalization=True) # Automatic normalization of tweets by enabling normalization


line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry: SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry: SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry: SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry: SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry: SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry: SC has first two presumptive cases of coronavirus , DHEC confirms  "
input_ids = torch.tensor([tokenizer.encode(line)]) 

print(input_ids)
with torch.no_grad():
    features = bertweet(input_ids)

控制台输出:

RuntimeError: The expanded size of the tensor (136) must match the existing size (130) at non-singleton dimension 1.  Target sizes: [1, 136].  Tensor sizes: [1, 130]

但是,如果您将其更改line为:

line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"

,然后模型成功地对句子进行编码。这是预期的行为吗?我知道 BERT 一个句子最多有 512 个单词,而 BERTweet 基本上是一个微调过的 BERT。剪掉较长的句子是个好主意吗,这对我的问题来说是一个可以接受的解决方案吗?提前致谢。

4

0 回答 0