我需要在 Opennlp 中训练 Chunker,将训练数据分类为名词短语。我该如何进行?在线文档没有解释如何在没有命令行的情况下执行此操作,并包含在程序中。它说使用 en-chunker.train,但你如何制作该文件?
编辑:@Alaye 运行您在答案中提供的代码后,我收到以下无法修复的错误:
Indexing events using cutoff of 5
Computing event counts... done. 3 events
Dropped event B-NP:[w_2=bos, w_1=bos, w0=He, w1=reckons, w2=., w_1=bosw0=He, w0=Hew1=reckons, t_2=bos, t_1=bos, t0=PRP, t1=VBZ, t2=., t_2=bost_1=bos, t_1=bost0=PRP, t0=PRPt1=VBZ, t1=VBZt2=., t_2=bost_1=bost0=PRP, t_1=bost0=PRPt1=VBZ, t0=PRPt1=VBZt2=., p_2=bos, p_1=bos, p_2=bosp_1=bos, p_1=bost_2=bos, p_1=bost_1=bos, p_1=bost0=PRP, p_1=bost1=VBZ, p_1=bost2=., p_1=bost_2=bost_1=bos, p_1=bost_1=bost0=PRP, p_1=bost0=PRPt1=VBZ, p_1=bost1=VBZt2=., p_1=bost_2=bost_1=bost0=PRP, p_1=bost_1=bost0=PRPt1=VBZ, p_1=bost0=PRPt1=VBZt2=., p_1=bosw_2=bos, p_1=bosw_1=bos, p_1=bosw0=He, p_1=bosw1=reckons, p_1=bosw2=., p_1=bosw_1=bosw0=He, p_1=bosw0=Hew1=reckons]
Dropped event B-VP:[w_2=bos, w_1=He, w0=reckons, w1=., w2=eos, w_1=Hew0=reckons, w0=reckonsw1=., t_2=bos, t_1=PRP, t0=VBZ, t1=., t2=eos, t_2=bost_1=PRP, t_1=PRPt0=VBZ, t0=VBZt1=., t1=.t2=eos, t_2=bost_1=PRPt0=VBZ, t_1=PRPt0=VBZt1=., t0=VBZt1=.t2=eos, p_2=bos, p_1=B-NP, p_2=bosp_1=B-NP, p_1=B-NPt_2=bos, p_1=B-NPt_1=PRP, p_1=B-NPt0=VBZ, p_1=B-NPt1=., p_1=B-NPt2=eos, p_1=B-NPt_2=bost_1=PRP, p_1=B-NPt_1=PRPt0=VBZ, p_1=B-NPt0=VBZt1=., p_1=B-NPt1=.t2=eos, p_1=B-NPt_2=bost_1=PRPt0=VBZ, p_1=B-NPt_1=PRPt0=VBZt1=., p_1=B-NPt0=VBZt1=.t2=eos, p_1=B-NPw_2=bos, p_1=B-NPw_1=He, p_1=B-NPw0=reckons, p_1=B-NPw1=., p_1=B-NPw2=eos, p_1=B-NPw_1=Hew0=reckons, p_1=B-NPw0=reckonsw1=.]
Dropped event O:[w_2=He, w_1=reckons, w0=., w1=eos, w2=eos, w_1=reckonsw0=., w0=.w1=eos, t_2=PRP, t_1=VBZ, t0=., t1=eos, t2=eos, t_2=PRPt_1=VBZ, t_1=VBZt0=., t0=.t1=eos, t1=eost2=eos, t_2=PRPt_1=VBZt0=., t_1=VBZt0=.t1=eos, t0=.t1=eost2=eos, p_2B-NP, p_1=B-VP, p_2B-NPp_1=B-VP, p_1=B-VPt_2=PRP, p_1=B-VPt_1=VBZ, p_1=B-VPt0=., p_1=B-VPt1=eos, p_1=B-VPt2=eos, p_1=B-VPt_2=PRPt_1=VBZ, p_1=B-VPt_1=VBZt0=., p_1=B-VPt0=.t1=eos, p_1=B-VPt1=eost2=eos, p_1=B-VPt_2=PRPt_1=VBZt0=., p_1=B-VPt_1=VBZt0=.t1=eos, p_1=B-VPt0=.t1=eost2=eos, p_1=B-VPw_2=He, p_1=B-VPw_1=reckons, p_1=B-VPw0=., p_1=B-VPw1=eos, p_1=B-VPw2=eos, p_1=B-VPw_1=reckonsw0=., p_1=B-VPw0=.w1=eos]
Indexing... done.
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at opennlp.tools.ml.model.AbstractDataIndexer.sortAndMerge(AbstractDataIndexer.java:89)
at opennlp.tools.ml.model.TwoPassDataIndexer.<init>(TwoPassDataIndexer.java:105)
at opennlp.tools.ml.AbstractEventTrainer.getDataIndexer(AbstractEventTrainer.java:74)
at opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:91)
at opennlp.tools.ml.model.TrainUtil.train(TrainUtil.java:53)
at opennlp.tools.chunker.ChunkerME.train(ChunkerME.java:253)
at com.oracle.crm.nlp.CustomChunker2.main(CustomChunker2.java:91)
Sorting and merging events... Process exited with exit code 1.
(我的 en-chunker.train 只有你的样本数据集的前 2 行和最后一行。)你能告诉我为什么会发生这种情况以及如何解决它吗?
EDIT2:我让 Chunker 工作,但是当我将训练集中的句子更改为您在答案中给出的句子之外的任何句子时,它会出错。你能告诉我为什么会这样吗?