我正在使用 doccano 进行序列标记,并使用 spacy 进行进一步建模。我标记的一些句子不包含我感兴趣的任何标签,因此它们保持“未标记”,即。没有标签。
{"id": 79, "data": "This powerful charm would protect him until he became of age, or no longer called his aunt's house home.", "label": []}
{"id": 82, "data": "He began attending Hogwarts School of Witchcraft and Wizardry in 1991.", "label": []}
{"id": 85, "data": "He later became the youngest Quidditch Seeker in over a century and eventually the captain of the Gryffindor House Quidditch Team in his sixth year, winning two Quidditch Cups.", "label": []}
我想训练 SpaCy 识别所有变体中的角色名称。
现在的问题:
- 为了训练 SpaCy 模型而包含未标记的实例有什么价值吗?
- 如果有那么我应该将此数据声明为“不平衡数据集”并采取相应措施吗?(提升?重击?过采样?等)
- 在这种情况下,最佳做法是什么?