TL;博士;
如何mllib
训练我的 wiki 数据(文本和类别)以预测推文?
我无法弄清楚如何转换我的标记化 wiki 数据,以便可以通过NaiveBayes
或LogisticRegression
. 我的目标是使用经过训练的模型与推文进行比较*。我尝试过使用 LR 和 for 的管道,HashingTF
但IDF
我NaiveBayes
一直得到错误的预测。这是我尝试过的:
*请注意,我想将 wiki 数据中的许多类别用于我的标签...我只看到二进制分类(它是一个类别或另一个类别)....有可能做我想做的事吗?
带 LR 的管道
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.ml.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.ml.feature.RegexTokenizer
case class WikiData(category: String, text: String)
case class LabeledData(category: String, text: String, label: Double)
val wikiData = sc.parallelize(List(WikiData("Spark", "this is about spark"), WikiData("Hadoop","then there is hadoop")))
val categoryMap = wikiData.map(x=>x.category).distinct.zipWithIndex.mapValues(x=>x.toDouble/1000).collectAsMap
val labeledData = wikiData.map(x=>LabeledData(x.category, x.text, categoryMap.get(x.category).getOrElse(0.0))).toDF
val tokenizer = new RegexTokenizer()
.setInputCol("text")
.setOutputCol("words")
.setPattern("/W+")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.01)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
val model = pipeline.fit(labeledData)
model.transform(labeledData).show
朴素贝叶斯
val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documentsAsWordSequenceAlready)
import org.apache.spark.mllib.feature.IDF
tf.cache()
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
tf.cache()
val idf = new IDF(minDocFreq = 2).fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
//to create tfidfLabeled (below) I ran a map set the labels...but again it seems to have to be 1.0 or 0.0?
NaiveBayes.train(tfidfLabeled)
.predict(hashingTF.transform(tweet))
.collect