python - 从头开始为 Training ELMO Embedding 准备训练数据

Question

我正在尝试构建自己的自定义化学域 ELMO 嵌入。我正在遵循https://github.com/allenai/bilm-tf的指示

如果我在化学等领域有很多多词标记，我该如何准备训练数据。例如：

1. Original Sentences: "This is a multi word chemical component 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide. \n This is another sentence."

这里“3-(4,5-二甲基噻唑-2-基)-2,5-二苯基溴化四唑”是一个单一的记号。令牌内有多个以空格分隔的单词。这将导致上述标记被拆分为 3 个标记：['3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl','tetrazolium','bromide']。

我怎样才能避免这种情况？我可以提供以下格式的输入训练数据来避免这种情况吗？

训练数据（1）：每个句子的标记列表。因此，训练文本文件将包含列表标记列表。

[['This', 'is', 'a', 'multi', 'word', 'chemical', 'component', '3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl四唑溴化物'], ['这个', '是', '另一个', '句子']]

训练数据（2）：这里我用“|”连接了多关键字标记象征。“这是一个多词化学成分 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl|tetrazolium|bromide。\n 这是另一个句子。”

请指导准备训练数据的最佳方法。

score 0 · Accepted Answer

您可以通过添加自己的特殊情况来创建自己的自定义 spaCy Tokenizer。

首先，安装所需的软件包。

pip install spacy
python -m spacy download en_core_web_sm

然后，运行以下代码。

import spacy
from spacy.symbols import ORTH

input = "This is a multi word chemical component 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide.\nThis is another sentence."
output = []

nlp_tokenisation = spacy.load("en_core_web_sm") # Initialise

# Add additional rules
special_case = [{ORTH: "3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide"}]
nlp_tokenisation.tokenizer.add_special_case("3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide", special_case)

input = input.split("\n") # Split lines

for line in input:
    doc = nlp_tokenisation(line)
    output.append([token.text for token in doc])

print(output)

它应该为您返回以下输出：

[['This', 'is', 'a', 'multi', 'word', 'chemical', 'component', '3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide', '.'], ['This', 'is', 'another', 'sentence', '.']]

然后，您可以根据自己的需要调整标记器（例如微调标点符号）。您可以创建一个脚本以自动将所有化学术语输入此标记器。有关 spaCy 的更多信息，请参阅他们关于Tokenizer和Linguistic Features的文档。尽管此响应较晚，但希望它对未来的开发人员有所帮助。

python - 从头开始为 Training ELMO Embedding 准备训练数据

1 回答 1

Related

Reference