nlp - 关于解析英文模型中的括号

Question

这句话是简体维基百科的一部分：

空气中有三种物质，氮气（79%）、氧气（20%）和其他类型的气体（1%）。

括号中的百分比在 spaCy 2.0 和 2.1 中处理得不好。处理此类问题的最佳方法是什么？

这是可视化：

score 1 · Accepted Answer

使用 regex & spacy 的 merge/retokenize 方法将括号中的内容合并为单个标记。

>>> import spacy
>>> import re
>>> my_str = "There are three things in air, Nitrogen (79%), oxygen (20%), and other types of gases (1%)."
>>> nlp = spacy.load('en')
>>> parsed = nlp(my_str)
>>> [(x.text,x.pos_) for x in parsed]
[('There', 'ADV'), ('are', 'VERB'), ('three', 'NUM'), ('things', 'NOUN'), ('in', 'ADP'), ('air', 'NOUN'), (',', 'PUNCT'), ('Nitrogen', 'PROPN'), ('(', 'PUNCT'), ('79', 'NUM'), ('%', 'NOUN'), (')', 'PUNCT'), (',', 'PUNCT'), ('oxygen', 'NOUN'), ('(', 'PUNCT'), ('20', 'NUM'), ('%', 'NOUN'), (')', 'PUNCT'), (',', 'PUNCT'), ('and', 'CCONJ'), ('other', 'ADJ'), ('types', 'NOUN'), ('of', 'ADP'), ('gases', 'NOUN'), ('(', 'PUNCT'), ('1', 'NUM'), ('%', 'NOUN'), (')', 'PUNCT'), ('.', 'PUNCT')]

>>> indexes = [m.span() for m in re.finditer('\([\w%]{0,5}\)',my_str,flags=re.IGNORECASE)]
>>> indexes
[(40, 45), (54, 59), (86, 90)]
>>> for start,end in indexes:
...     parsed.merge(start_idx=start,end_idx=end)
...
(79%)
(20%)
(1%)
>>> [(x.text,x.pos_) for x in parsed]
[('There', 'ADV'), ('are', 'VERB'), ('three', 'NUM'), ('things', 'NOUN'), ('in', 'ADP'), ('air', 'NOUN'), (',', 'PUNCT'), ('Nitrogen', 'PROPN'), ('(79%)', 'PUNCT'), (',', 'PUNCT'), ('oxygen', 'NOUN'), ('(20%)', 'PUNCT'), (',', 'PUNCT'), ('and', 'CCONJ'), ('other', 'ADJ'), ('types', 'NOUN'), ('of', 'ADP'), ('gases', 'NOUN'), ('(1%)', 'PUNCT'), ('.', 'PUNCT')]

score 0 · Accepted Answer

最初在这里的问题跟踪器上写了一个答案，但 Stack Overflow 绝对是解决这类问题的更好地方。

我刚刚使用最新版本测试了您的示例，标记化如下所示：

['There', 'are', 'three', 'things', 'in', 'air', ',', 'Nitrogen', '(', '79', '%', ')', ',', 
'oxygen', '(', '20', '%', ')', ',', 'and', 'other', 'types', 'of', 'gases', '(', '1', '%', ')', '.']

这是解析树，对我来说看起来不错。（如果您想自己尝试一下，请注意我设置options={'collapse_punct': False, 'compact': True}为分别显示所有标点符号，并使大树更易于阅读。）

也就是说，您可能还可以找到很多边缘案例和示例，说明开箱即用的标记化规则无法概括标点符号和括号的所有组合，或者预训练的解析器或标记器生成的不正确的预测。因此，如果您正在处理括号中的较长插入并且解析器与这些问题斗争，您可能需要使用更多类似的示例对其进行微调。

孤立地查看单个句子并不是很有帮助，因为它不能让您很好地了解数据的整体准确性以及应该关注的内容。即使你训练了一个最先进的模型，它对你的数据有 90% 的准确率，这仍然意味着它每 10 个预测都是错误的。

nlp - 关于解析英文模型中的括号

2 回答 2

Related

Reference