python - 用于命名实体识别的 PyTorch Huggingface BERT-NLP

Question

很长一段时间以来，我一直在使用 HuggingFace 的 Google BERT的PyTorch实现来处理 MADE 1.0 数据集。直到上次（2 月 11 日），我一直在使用该库并通过微调模型为我的命名实体识别任务获得0.81的F 分数。但是这周当我运行之前编译和运行的完全相同的代码时，它在执行这个语句时抛出了一个错误：

input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts], maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

ValueError：令牌索引序列长度大于此 BERT 模型的指定最大序列长度 (632 > 512)。通过 BERT 运行此序列将导致索引错误

此colab 笔记本中提供了完整代码。

为了解决这个错误，我通过获取任何序列的前 512 个标记将上述语句修改为下面的语句，并根据 BERT 的要求进行了必要的更改，以将 [SEP] 的索引添加到截断/填充序列的末尾。

input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt[:512]) for txt in tokenized_texts], maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

结果不应该改变，因为我只考虑序列中的前 512 个标记，然后将其截断为 75（MAX_LEN=75），但我的F-Score已降至0.40，精度降至0.27，而Recall仍然是相同(0.85)。我无法共享数据集，因为我已经签署了保密条款，但我可以确保 BERT 要求的所有预处理都已完成，并且所有扩展标记（如 (Johanson --> Johan ##son)）都已标记为 X 并替换后来在BERT 论文中所说的预测之后。

有没有其他人遇到过类似的问题，或者可以详细说明可能是什么问题，或者 PyTorch (Huggingface) 人们最近做了什么改变？

score 4 · Accepted Answer

我找到了解决这个问题的方法。使用 pytorch-pretrained-bert==0.4.0 运行相同的代码即可解决问题，性能恢复正常。新更新中的 BERT Tokenizer 或 BERTForTokenClassification 中的模型性能有些混乱，这会影响模型性能。希望 HuggingFace 尽快解决这个问题。:)

pytorch-pretrained-bert==0.4.0，测试 F1 分数：0.82

pytorch-pretrained-bert==0.6.1，测试 F1 分数：0.41

谢谢。

score 1 · Accepted Answer

我认为您应该使用batch_encode_plus和屏蔽输出以及编码。

请参阅https://huggingface.co/transformers/main_classes/tokenizer.html中的 batch_encode_plus

python - 用于命名实体识别的 PyTorch Huggingface BERT-NLP

2 回答 2

Related

Reference