tensorflow - create_pretraining_data.py 在训练自定义 BERT 模型时将 0 条记录写入 tf_examples.tfrecord

Question

我正在我自己的语料库上编写一个自定义的 BERT 模型，我使用 BertWordPieceTokenizer 生成了词汇文件，然后在下面的代码中运行

!python create_pretraining_data.py
--input_file=/content/drive/My Drive/internet_archive_scifi_v3.txt
--output_file=/content/sample_data/tf_examples.tfrecord
--vocab_file=/content/sample_data/sifi_13sep-vocab.txt
--do_lower_case=True
--max_seq_length=128
--max_predictions_per_seq=20
--masked_lm_prob=0.15
--random_seed=12345
--dupe_factor=5

获取输出为：

INFO:tensorflow:*** Reading from input files ***

INFO:tensorflow:*** Writing to output files ***

INFO:tensorflow: /content/sample_data/tf_examples.tfrecord

INFO:tensorflow:Wrote 0 total instances

不知道为什么我总是得到 0 个实例tf_examples.tfrecord，我做错了什么？

我正在使用TF version 1.12 仅供参考..生成的词汇文件是 290 KB。

score 0 · Accepted Answer

它无法读取输入文件，请使用My\ Drive代替My Drive：

--input_file=/content/drive/My\ Drive/internet_archive_scifi_v3.txt

tensorflow - create_pretraining_data.py 在训练自定义 BERT 模型时将 0 条记录写入 tf_examples.tfrecord

1 回答 1

Related

Reference