我遇到了一个对我来说没有多大意义的错误,并且无法在网上找到足够的信息来自己回答。
我编写了代码来生成 (String, ArrayBuffer[String]) 对的列表,然后使用 HashingTF 将特征列转换为向量(因为它用于 NLP 研究解析,最终我得到了很多独特的特征;很长的故事)。然后我使用 StringIndexer 转换字符串标签。在训练数据上运行 ChiSqSelector.fit 时出现“未找到密钥”错误。堆栈跟踪指向 ChiSqTest 中标签的哈希图查找。这让我觉得很奇怪,因为我可以有某种理由认为我可能用错了它并且没有以某种方式解释看不见的标签——除了这是训练数据的拟合方法。
无论如何,这是我的代码中有趣的部分,后面是堆栈跟踪的重要部分。任何帮助将不胜感激!
val parSdp = sc.parallelize(sdp.take(10)) // it dies on a small amount of data
val insts: RDD[(String, ArrayBuffer[String])] =
parSdp.flatMap(x=> TrainTest.transformGraphSpark(x))
val indexer = new StringIndexer()
.setInputCol("labels")
.setOutputCol("labelIndex")
val instDF = sqlContext.createDataFrame(insts)
.toDF("labels","feats")
val hash = new HashingTF()
.setInputCol("feats")
.setOutputCol("hashedFeats")
.setNumFeatures(1000000)
val readyDF = hash.transform(indexer
.fit(instDF)
.transform(instDF))
val selector = new ChiSqSelector()
.setNumTopFeatures(100)
.setFeaturesCol("hashedFeats")
.setLabelCol("labelIndex")
.setOutputCol("selectedFeatures")
val Array(training, dev,test) = readyDF.randomSplit(Array(0.8,0.1,0.1), seed = 12345)
val chisq = selector.fit(training)
和堆栈跟踪:
java.util.NoSuchElementException: key not found: 23.0
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:58)
at org.apache.spark.mllib.stat.test.ChiSqTest$$anonfun$chiSquaredFeatures$4$$anonfun$apply$4.apply(ChiSqTest.scala:131)
at org.apache.spark.mllib.stat.test.ChiSqTest$$anonfun$chiSquaredFeatures$4$$anonfun$apply$4.apply(ChiSqTest.scala:129)
at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:153)
at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306)
at org.apache.spark.mllib.stat.test.ChiSqTest$$anonfun$chiSquaredFeatures$4.apply(ChiSqTest.scala:129)
at org.apache.spark.mllib.stat.test.ChiSqTest$$anonfun$chiSquaredFeatures$4.apply(ChiSqTest.scala:125)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.mllib.stat.test.ChiSqTest$.chiSquaredFeatures(ChiSqTest.scala:125)
at org.apache.spark.mllib.stat.Statistics$.chiSqTest(Statistics.scala:176)
at org.apache.spark.mllib.feature.ChiSqSelector.fit(ChiSqSelector.scala:193)
at org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:86)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:89)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:122)
... etc etc
我还意识到,通过将 sdp.take 的大小更改为更大(至 100),我得到了一个不同的错误:
java.lang.IllegalArgumentException: Chi-squared statistic undefined for input matrix due to0 sum in column [4].
at org.apache.spark.mllib.stat.test.ChiSqTest$.chiSquaredMatrix(ChiSqTest.scala:229)
at org.apache.spark.mllib.stat.test.ChiSqTest$$anonfun$chiSquaredFeatures$4.apply(ChiSqTest.scala:134)
at org.apache.spark.mllib.stat.test.ChiSqTest$$anonfun$chiSquaredFeatures$4.apply(ChiSqTest.scala:125)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.mllib.stat.test.ChiSqTest$.chiSquaredFeatures(ChiSqTest.scala:125)
at org.apache.spark.mllib.stat.Statistics$.chiSqTest(Statistics.scala:176)
at org.apache.spark.mllib.feature.ChiSqSelector.fit(ChiSqSelector.scala:193)
at org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:86)
at $iwC$$iwC.<init>(<console>:96)
at $iwC.<init>(<console>:130)