nlp - 无法将制表符分隔文件输入到斯坦福分类器

Question

我在将制表符分隔的文件输入到stanford 分类器时遇到问题。

尽管我能够成功浏览所有包含的斯坦福教程，包括新闻组教程，但当我尝试输入自己的训练和测试数据时，它无法正确加载。

起初我认为问题在于我使用 Excel 电子表格将数据保存到制表符分隔的文件中，这是某种编码问题。

但是当我执行以下操作时，我得到了完全相同的结果。首先，我将下面的演示数据逐字输入 gedit，确保在政治/体育类和随后的文本之间使用一个选项卡：


politics    Obama today announced a new immigration policy.
sports  The NBA all-star game was last weekend. 
politics    Both parties are eyeing the next midterm elections.
politics    Congress votes tomorrow on electoral reforms.
sports  The Lakers lost again last night, 102-100.
politics    The Supreme Court will rule on gay marriage this spring.
sports  The Red Sox report to spring training in two weeks.
sports  Messi set a world record for goals in a calendar year in 2012.
politics    The Senate will vote on a new budget proposal next week.
politics    The President declared on Friday that he will veto any budget that doesn't include revenue increases.

我将它保存为myproject/demo-train.txt和一个类似的文件为myproject/demo-test.txt.

然后我运行了以下命令：

java -mx1800m -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier 
-trainFile myproject/demo-train.txt -testFile myproject/demo-test.txt

好消息：这实际上运行时没有抛出任何错误。

坏消息：由于它不提取任何特征，它实际上无法估计一个真实的模型，并且1/n每个项目的概率默认为，其中n是类的数量。

然后我运行了相同的命令，但指定了两个基本选项：

java -mx1800m -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier 
-trainFile myproject/demo-train.txt -testFile myproject/demo-test.txt -2.useSplitWords =2.splitWordsRegexp "\s+"

结果是：

Exception in thread "main" java.lang.RuntimeException: Training dataset could not be processed
    at edu.stanford.nlp.classify.ColumnDataClassifier.readDataset(ColumnDataClassifier.java:402)
    at edu.stanford.nlp.classify.ColumnDataClassifier.readTrainingExamples  (ColumnDataClassifier.java:317)
    at edu.stanford.nlp.classify.ColumnDataClassifier.trainClassifier(ColumnDataClassifier.java:1652)
    at edu.stanford.nlp.classify.ColumnDataClassifier.main(ColumnDataClassifier.java:1628)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
    at edu.stanford.nlp.classify.ColumnDataClassifier.makeDatum(ColumnDataClassifier.java:670)
    at edu.stanford.nlp.classify.ColumnDataClassifier.makeDatumFromLine(ColumnDataClassifier.java:267)
    at edu.stanford.nlp.classify.ColumnDataClassifier.makeDatum(ColumnDataClassifier.java:396)
    ... 3 more

这些与我使用从 Excel 保存的真实数据时得到的结果完全相同。

更重要的是，我不知道如何理解ArrayIndexOutOfBoundsException. 当我readline在 python 中为我创建的演示文件和有效的教程文件打印出原始字符串时，格式似乎没有什么不同。所以我不知道为什么用一组文件而不是另一组文件会引发这个异常。

最后，另一个怪癖。在某一时刻，我认为换行符可能是问题所在。所以我从演示文件中删除了所有换行符，同时保留了制表符并运行了相同的命令：

java -mx1800m -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier 
-trainFile myproject/demo-train.txt -testFile myproject/demo-test.txt -2.useSplitWords =2.splitWordsRegexp "\s+"

令人惊讶的是，这次没有抛出 java 异常。但同样，它毫无价值：它将整个文件视为一个观察结果，因此无法正确拟合模型。

我现在已经花了 8 个小时在这上面，并且已经用尽了我能想到的一切。我是 Java 新手，但我认为这不应该是一个问题——根据斯坦福的API 文档，ColumnDataClassifier只需要一个制表符分隔的文件。

任何帮助将非常感激。

最后一点：我在 Windows 和 Ubuntu 上使用相同的文件运行了这些相同的命令，并且每个命令的结果都是相同的。

score 2 · Accepted Answer

使用属性文件。在示例斯坦福分类器示例中

trainFile=20news-bydate-devtrain-stanford-classifier.txt
testFile=20news-bydate-devtest-stanford-classifier.txt
2.useSplitWords=true
2.splitWordsTokenizerRegexp=[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.
2.splitWordsIgnoreRegexp=\\s+

第 3、4 和 5 行开头的数字 2 表示 tsv 文件中的列。所以在你的情况下你会使用

trainFile=20news-bydate-devtrain-stanford-classifier.txt
testFile=20news-bydate-devtest-stanford-classifier.txt
1.useSplitWords=true
1.splitWordsTokenizerRegexp=[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.
1.splitWordsIgnoreRegexp=\\s+

或者如果您想使用命令行参数运行

java -mx1800m -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier -trainFile myproject/demo-train.txt -testFile myproject/demo-test.txt -1.useSplitWords =1.splitWordsRegexp "\s+"

score 0 · Accepted Answer

我遇到了和你一样的错误。

注意要分类的文本中的选项卡。

Caused by: java.lang.ArrayIndexOutOfBoundsException: 2

这意味着，在使用制表符拆分字符串之后，分类器在某些时候需要 3 个元素的数组。

我已经运行了一个方法，它计算每行中的制表符数量，如果在某行没有两个制表符 - 这是一个错误。

nlp - 无法将制表符分隔文件输入到斯坦福分类器

2 回答 2

Related

Reference