python - NLTK 保存经过训练的 Brill 模型

Question

我正在使用py-crfsuiteNLTK 中提供的方法训练 Brill 的 POS 标记器。但是，当我尝试保存经过训练的模型时，出现以下错误：

crf_tagger = CRFTagger()    
crf_tagger.train(train_sents, 'model_trained.crf.tagger')
templates = nltk.tag.brill.nltkdemo18()
trainer = nltk.tag.brill_trainer.BrillTaggerTrainer(crf_tagger, templates)
bt = trainer.train(train_sents, max_rules=10)

file_writing = file('trained_brill_tagger.yaml', 'w')
yaml.dump(bt, file_writing)

#even pickle fails
file_w = open('trained_brills.pickle', 'wb')
pickle.dump(bt, file_w)
file_w.close()

pycrfsuite._pycrfsuite.Tagger 中的文件“stringsource”，第 2 行。reduce_cython 类型错误 ：self.c_tagger 无法转换为 Python 对象进行酸洗

我尝试过使用，pickle但是错误似乎仍然存在。有没有办法解决这个问题。这是因为使用 CRF 标记器作为基线吗？谢谢你。dillyaml

score 2 · Accepted Answer

我意识到问题出在CRFTagger模块中。如果我对 Brill 使用不同的初始标记器，则不会产生错误并保存模型。

trainer = nltk.tag.brill_trainer.BrillTaggerTrainer(baseline_tagger, templates)

CRFTagger()当 baseline_tagger 是一个对象时，我无法保存经过训练的模型。NgramTagger出于某种原因，使用类似 an 的东西可以解决问题。

score 1 · Accepted Answer

这是一个如何nltk.tag.brill_trainer.BrillTaggerTrainer在 NLTK v3.2.5中训练 a 的示例

from nltk.corpus import treebank

from nltk.tag import BrillTaggerTrainer, RegexpTagger, UnigramTagger
from nltk.tbl.demo import REGEXP_TAGGER, _demo_prepare_data, _demo_prepare_data
from nltk.tag.brill import describe_template_sets, brill24

baseline_backoff_tagger = REGEXP_TAGGER
templates = brill24()
tagged_data = treebank.tagged_sents()
train=0.8
trace=3
num_sents=1000
randomize=False
separate_baseline_data=False

(training_data, baseline_data, gold_data, testing_data) = \
   _demo_prepare_data(tagged_data, train, num_sents, randomize, separate_baseline_data)

baseline_tagger = UnigramTagger(baseline_data, backoff=baseline_backoff_tagger)

# creating a Brill tagger
trainer = BrillTaggerTrainer(baseline_tagger, templates, trace, ruleformat="str")

然后要保存培训师，只需pickle：

import pickle
with open('brill-demo.pkl', 'wb') as fout:
    pickle.dump(trainer, fout)

python - NLTK 保存经过训练的 Brill 模型

2 回答 2

Related

Reference