我想在我正在训练 OpenNMT 模型的 Google Colab 项目中使用来自https://github.com/google/sentencepiece的句子。我对如何在 Google Colab 中设置句子二进制文件有点困惑。我需要用cmake构建吗?
当我尝试安装pip install sentencepiece
并尝试在脚本的“转换”中包含句子时,我收到以下错误
运行此脚本后(与 OpenNMT 翻译教程匹配)
!onmt_build_vocab -config en-sp.yaml -n_sample -1
我得到:
Traceback (most recent call last):
File "/usr/local/bin/onmt_build_vocab", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.7/dist-packages/onmt/bin/build_vocab.py", line 63, in main
build_vocab_main(opts)
File "/usr/local/lib/python3.7/dist-packages/onmt/bin/build_vocab.py", line 32, in build_vocab_main
transforms = make_transforms(opts, transforms_cls, fields)
File "/usr/local/lib/python3.7/dist-packages/onmt/transforms/transform.py", line 176, in make_transforms
transform_obj.warm_up(vocabs)
File "/usr/local/lib/python3.7/dist-packages/onmt/transforms/tokenize.py", line 110, in warm_up
load_src_model.Load(self.src_subword_model)
File "/usr/local/lib/python3.7/dist-packages/sentencepiece/__init__.py", line 367, in Load
return self.LoadFromFile(model_file)
File "/usr/local/lib/python3.7/dist-packages/sentencepiece/__init__.py", line 171, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string
以下是我的脚本的编写方式。我不确定 not a string 来自什么。
## Where the samples will be written
save_data: en-sp/run/example
## Where the vocab(s) will be written
src_vocab: en-sp/run/example.vocab.src
tgt_vocab: en-sp/run/example.vocab.tgt
## Where the model will be saved
save_model: drive/MyDrive/Europarl/model/model
# Prevent overwriting existing files in the folder
overwrite: False
# Corpus opts:
data:
europarl:
path_src: train_europarl-v7.es-en.es
path_tgt: train_europarl-v7.es-en.en
transforms: [sentencepiece, filtertoolong]
weight: 1
valid:
path_src: dev_europarl-v7.es-en.es
path_tgt: dev_europarl-v7.es-en.en
transforms: [sentencepiece]
skip_empty_level: silent
world_size: 1
gpu_ranks: [0]
...
编辑:所以我继续用谷歌搜索这个问题,发现了一个谷歌 colab 项目,它在这里使用 cmake 构建句子https://colab.research.google.com/github/mymusise/gpt2-quickly/blob/main/examples/gpt2_quickly .ipynb#scrollTo=dDAup5dxDXZW。但是,即使在使用 cmake 构建之后,我仍然遇到这个问题。