java - Stanford POS Tagger: How to preserve newlines in the output?

Question

My input.txt file contains the following sample text:

you have to let's
come and see me.

Now if I invoke the Stanford POS tagger with the default command:

java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -textFile input.txt > output.txt

I get the following in my output.txt file:

you_PRP have_VBP to_TO let_VB 's_POS come_VB and_CC see_VB me_PRP ._.

The problem with the above output is that I have lost my original newline delimiter used in the input file.

Now, if I use the following command to preserve my newline sentence delimiter in the output file I have to set -tokenize option to false:

java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -sentenceDelimiter newline -tokenize false -textFile input.txt > output.txt

The problem with this code is that it totally messed up the output:

you_PRP have_VBP to_TO let's_NNS  
come_VB and_CC see_VB me._NN

Here let's and me. are tagged inappropriately.

My question is how can I preserve the newline delimiters in the output file without messing up the tokenization?

score 1 · Accepted Answer

答案应该是使用命令：

java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -sentenceDelimiter newline -textFile input.txt > output.txt

但是在 3.1.3 版本（可能还有所有早期版本）中存在一个错误并且它不起作用（忽略换行符）。它将在 3.1.4+ 版本中运行。

同时，如果文本量很小，您可以尝试使用斯坦福解析器（其中相应标志的名称不同，因此它是-sentences newline）。

score 0 · Accepted Answer

您可以做的一件事是使用 xml 输入而不是纯文本。在这种情况下，您的输入将是：

<xml version="1.0" encoding="UTF-8">
<text>
    <line>you have to let's</line>
    <line>come and see me.</line>
</text>

在这里，每一行都包含在一个行标记中。您现在可以发出以下命令：

java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -xmlInput line -textFile sample.xml > ouput.xml

请注意，参数“-xmlInput”指定用于 POS 标记的标记。在我们的例子中，这个标签是line。当您运行上述命令时，输出将是：

<?xml version="1.0" encoding="UTF-8"?>
<text>
    <line>
        you_PRP have_VBP to_TO let_VB &apos;s_POS 
    </line>
    <line>
        come_VB and_CC see_VB me_PRP ._. 
    </line>
</text>

因此，您可以通过阅读包含在行标签中的内容来分离您的行。

java - Stanford POS Tagger: How to preserve newlines in the output?

2 回答 2

Related

Reference