java - opennlp 更改 eos 字符

Question

我想在我的 openNlp SentenceDetectorME 中更改句尾分隔符。我正在使用 opennlp 1.5.3。由于普通版本只检测以'.'分隔的短语，我的目的是添加其他句子分隔符，如';'，'！' 和 '?'，将 char 数组 eos[] 传递给 SentenceDetectorFactory。我读到您必须使用 .train 方法 SentenceDetectorME，但我不明白如何，因为它是静态的并且需要训练模型。有什么建议么？

我的代码：

import java.io.*;
import opennlp.tools.sentdetect.*;

public class SenTest {

public static void main(String[] args) throws IOException {

    String paragraph = "12oz bottle poured into a tulip. Pleasing aromas of citrus rind, lemongrass, peaches, and toasted caramel are picked up from the start. After it settles a bit, more of a fresh baked bread crust and tangerine comes through, and even later, the bread crust turns more towards a blackened pizza crust. It pours a slightly hazy copper-orange color with a creamy white head that retains well; it leaves a thick puffy ring with a creamy island and a decent, messy lace along the glass. Great balance between medium high levels of sweet and bitter. The texture is creamy on the palate with a body towards the higher end of medium. The carbonation is a touch effervescent or fizzy, but overall, soft. There’s a very pronounced grapefruit tartness up front, but it mellows quickly after the first few sips. It finishes with a zesty combination of lemongrass, caramel, and stonefruit. The aftertaste is primarily sweet, overripe tangerines and it’s peel with a tart grapefruit bitter lingering in the mouth. Overall very refreshing, straddles the line between IPA and APA.";
    char eos[] = {';', '.', '!', '?' };
    int counter = 0;
    // always start with a model, a model is learned from training data

    InputStream is = new FileInputStream( System.getProperty( "user.dir" ) + "/lib/en-sent.bin" );
    SentenceModel model = new SentenceModel( is );
    SentenceDetectorME sdetector = new SentenceDetectorME( model );


    String sentences[] = sdetector.sentDetect( paragraph );

    for ( String s : sentences ) {

        counter++;
        System.out.println( "Frase numero " + counter + ": " + s );
    }
    is.close();
}

}

score 0 · Accepted Answer

我认为您误解了培训的工作原理。

您将需要提供大量包含您希望检测的字符（！;）等的句子/段落。这是因为 opennlp 将检测句子中的特征以确定它是真正的句子拆分，还是只是标点符号插入其他原因。

举个例子：

海伦三十岁；；老的; 她其实还年轻！

在这一行中;;年;; 只是一些标记，不应被检测为句子拆分。（ ;; 出现的次数越多将确定它是否是一个句子拆分）

在您的示例中，您也可以只使用 string.split() 并根据输入的 eos 进行拆分，但这意味着您还将在 ;; 上拆分上面的句子。模式也是如此。

java - opennlp 更改 eos 字符

1 回答 1

Related

Reference