0

I have a situation where I am trying to using the pre-trained hugging-face models to translate a pandas column of text from Dutch to English. My input is simple:

Dutch_text             
Hallo, het gaat goed
Hallo, ik ben niet in orde
Stackoverflow is nuttig

I am using the below code to translate the above column and I want to store my result into a new column ENG_Text. So the output will look like this:

ENG_Text             
Hello, I am good
Hi, I'm not okay
Stackoverflow is helpful

The code that I am using is as follows:

#https://huggingface.co/Helsinki-NLP for other pretrained models 
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-nl-en")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-nl-en")
input_1 = df['Dutch_text']
input_ids = tokenizer("translate English to Dutch: "+input_1, return_tensors="pt").input_ids # Batch size 1
outputs = model.generate(input_ids)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)

Any help would be appreciated!

4

1 回答 1

1

这不是应该使用 MT 模型的方式。测试模型是否能理解指令并不是一个类似 GPT 的实验。是一种只能翻译的翻译模型,不需要添加指令"translate English to Dutch"。(你不想反过来翻译吗?)

此外,翻译模型被训练成逐句翻译。如果您连接列中的所有句子,它将被视为一个句子。您需要:

  1. 遍历列并独立翻译每个句子。

  2. 将列拆分为批次,以便您可以并行化翻译。请注意,在这种情况下,您需要填充批次中的句子以具有相同的长度。最简单的方法是使用batch_encode_plus标记器的方法。

于 2020-12-29T09:37:06.570 回答