machine-learning - 如何在 python crfsuite 中使用 Conll 2003 语料库

Question

我已经下载了 Conll 2003 语料库（“eng.train”）。我想用它来使用 python crfsuite 训练来提取实体。但我不知道如何加载这个文件进行训练。

我找到了这个例子，但它不适用于英语。

train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))

同样在未来，我想培训 POS 或位置以外的新实体。我该如何添加这些。

还请建议如何处理多个单词。

score 2 · Accepted Answer

您可以使用ConllCorpusReader。

这是一个一般的实施： ConllCorpusReader('file path', 'file name', columntypes=['','',''])

这是您可以使用的列类型列表：'WORDS', 'POS', 'TREE', 'CHUNK', 'NE', 'SRL', 'IGNORE'

示例：

from nltk.corpus.reader import ConllCorpusReader

train = ConllCorpusReader('CoNLL-2003', 'eng.train', ['words', 'pos', 'ignore', 'chunk'])
test = ConllCorpusReader('CoNLL-2003', 'eng.testa', ['words', 'pos', 'ignore', 'chunk'])

machine-learning - 如何在 python crfsuite 中使用 Conll 2003 语料库

1 回答 1

Related

Reference