我正在使用斯坦福分词器。但我有一个问题。
我输入命令:
$ C:\Users\toshiba\workspace\SegDemo\stanford-segmenter-2013-06-20>java -cp seg.jar;stanford-segmenter-3.2.0-javadoc.jar;stanford-segmenter-3.2.0-sources.jar -mx1g edu.stanford.nlp.international.arabic.process.ArabicSegmenter -loadClassifier data/arabic-segmenter-atbtrain.ser.gz -textFile phrase.txt > phrase.txt.segmented
我有以下过程:
Loaded ArabicTokenizer with options: null
loadClassifier=data/arabic-segmenter-atbtrain.ser.gz
textFile=phrase.txt
featureFactory=edu.stanford.nlp.international.arabic.process.ArabicSegmenterFeat
ureFactory
loadClassifier=data/arabic-segmenter-atbtrain.ser.gz
textFile=phrase.txt
featureFactory=edu.stanford.nlp.international.arabic.process.ArabicSegmenterFeat
ureFactory
Loading classifier from C:\Users\toshiba\workspace\SegDemo\stanford-segmenter-20
13-06-20\data\arabic-segmenter-atbtrain.ser.gz ... done [1,2 sec].
Untokenizable: ?
Done! Processed input text at 475,13 input characters/second
我不明白“ Untokenizale:? ”
在分词处理之前是否应该音译句子?