0

My input.txt file contains the following sample text:

you have to let's
come and see me.

Now if I invoke the Stanford POS tagger with the default command:

java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -textFile input.txt > output.txt

I get the following in my output.txt file:

you_PRP have_VBP to_TO let_VB 's_POS come_VB and_CC see_VB me_PRP ._.

The problem with the above output is that I have lost my original newline delimiter used in the input file.

Now, if I use the following command to preserve my newline sentence delimiter in the output file I have to set -tokenize option to false:

java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -sentenceDelimiter newline -tokenize false -textFile input.txt > output.txt 

The problem with this code is that it totally messed up the output:

you_PRP have_VBP to_TO let's_NNS  
come_VB and_CC see_VB me._NN

Here let's and me. are tagged inappropriately.

My question is how can I preserve the newline delimiters in the output file without messing up the tokenization?

4

2 回答 2

1

答案应该是使用命令:

java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -sentenceDelimiter newline -textFile input.txt > output.txt 

但是在 3.1.3 版本(可能还有所有早期版本)中存在一个错误并且它不起作用(忽略换行符)。它将在 3.1.4+ 版本中运行。

同时,如果文本量很小,您可以尝试使用斯坦福解析器(其中相应标志的名称不同,因此它是-sentences newline)。

于 2012-09-15T00:16:41.473 回答
0

您可以做的一件事是使用 xml 输入而不是纯文本。在这种情况下,您的输入将是:

<xml version="1.0" encoding="UTF-8">
<text>
    <line>you have to let's</line>
    <line>come and see me.</line>
</text>

在这里,每一行都包含在一个标记中。您现在可以发出以下命令:

java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -xmlInput line -textFile sample.xml > ouput.xml

请注意,参数“-xmlInput”指定用于 POS 标记的标记。在我们的例子中,这个标签是line。当您运行上述命令时,输出将是:

<?xml version="1.0" encoding="UTF-8"?>
<text>
    <line>
        you_PRP have_VBP to_TO let_VB &apos;s_POS 
    </line>
    <line>
        come_VB and_CC see_VB me_PRP ._. 
    </line>
</text>

因此,您可以通过阅读包含在行标签中的内容来分离您的行。

于 2012-08-27T17:42:56.953 回答