My input.txt file contains the following sample text:
you have to let's
come and see me.
Now if I invoke the Stanford POS tagger with the default command:
java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -textFile input.txt > output.txt
I get the following in my output.txt file:
you_PRP have_VBP to_TO let_VB 's_POS come_VB and_CC see_VB me_PRP ._.
The problem with the above output is that I have lost my original newline delimiter used in the input file.
Now, if I use the following command to preserve my newline sentence delimiter in the output file I have to set -tokenize option to false:
java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -sentenceDelimiter newline -tokenize false -textFile input.txt > output.txt
The problem with this code is that it totally messed up the output:
you_PRP have_VBP to_TO let's_NNS
come_VB and_CC see_VB me._NN
Here let's and me. are tagged inappropriately.
My question is how can I preserve the newline delimiters in the output file without messing up the tokenization?