我正在修补 predictioIO 以构建自定义分类引擎。我以前做过这个没有问题。但是对于当前数据集pio train
给了我一个错误tokens must not be empty.
,我编辑了 Datasource.scala 以将数据集中的字段提到引擎。我的数据集中的一行如下
{"event": "ticket", "eventTime": "2015-02-16T05:22:13.477+0000", "entityType": "content","entityId": 365,"properties":{"text": "Request to reset svn credentials","label": "Linux/Admin Task" }}
我可以毫无问题地导入数据和构建引擎。我也得到了一组观察结果。错误粘贴在下面
[INFO] [Remoting] Starting remoting
[INFO] [Remoting] Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.61.44:50713]
[INFO] [Engine$] EngineWorkflow.train
[INFO] [Engine$] DataSource: org.template.textclassification.DataSource@4fb64e14
[INFO] [Engine$] Preparator: org.template.textclassification.Preparator@5c4cc644
[INFO] [Engine$] AlgorithmList: List(org.template.textclassification.NBAlgorithm@62b6c045)
[INFO] [Engine$] Data sanity check is off.
[ERROR] [Executor] Exception in task 0.0 in stage 2.0 (TID 2)
[WARN] [TaskSetManager] Lost task 0.0 in stage 2.0 (TID 2, localhost): java.lang.IllegalArgumentException: tokens must not be empty
at opennlp.tools.util.StringList.<init>(StringList.java:61)
at org.template.textclassification.PreparedData.org$template$textclassification$PreparedData$$hash(Preparator.scala:71)
at org.template.textclassification.PreparedData$$anonfun$2.apply(Preparator.scala:113)
at org.template.textclassification.PreparedData$$anonfun$2.apply(Preparator.scala:113)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:202)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[ERROR] [TaskSetManager] Task 0 in stage 2.0 failed 1 times; aborting job
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): java.lang.IllegalArgumentException: tokens must not be empty
at opennlp.tools.util.StringList.<init>(StringList.java:61)
at org.template.textclassification.PreparedData.org$template$textclassification$PreparedData$$hash(Preparator.scala:71)
at org.template.textclassification.PreparedData$$anonfun$2.apply(Preparator.scala:113)
at org.template.textclassification.PreparedData$$anonfun$2.apply(Preparator.scala:113)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:202)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
问题出在数据集上。我确实将数据集分成几部分并进行了训练。该数据集的训练已完成,没有报告错误。我如何知道数据集中的哪一行产生错误?如果此功能在 PredictionIO 中应该非常有用。