machine-learning - DOCCANO 和 SpaCY 中未标记的实例。他们提供任何价值吗？

Question

我正在使用 doccano 进行序列标记，并使用 spacy 进行进一步建模。我标记的一些句子不包含我感兴趣的任何标签，因此它们保持“未标记”，即。没有标签。

{"id": 79, "data": "This powerful charm would protect him until he became of age, or no longer called his aunt's house home.", "label": []}
{"id": 82, "data": "He began attending Hogwarts School of Witchcraft and Wizardry in 1991.", "label": []}
{"id": 85, "data": "He later became the youngest Quidditch Seeker in over a century and eventually the captain of the Gryffindor House Quidditch Team in his sixth year, winning two Quidditch Cups.", "label": []}

我想训练 SpaCy 识别所有变体中的角色名称。

现在的问题：

为了训练 SpaCy 模型而包含未标记的实例有什么价值吗？
如果有那么我应该将此数据声明为“不平衡数据集”并采取相应措施吗？（提升？重击？过采样？等）
在这种情况下，最佳做法是什么？

score 0 · Accepted Answer

是的，您需要包含一些没有标记的示例，以便模型可以学习不标记的内容。例如，如果在所有示例句子中标记了所有大写单词，则模型可能会学会始终标记大写单词。

machine-learning - DOCCANO 和 SpaCY 中未标记的实例。他们提供任何价值吗？

1 回答 1

Related

Reference