5

I was trying to calculate Precision, Recall by Threshold for LogisticRegressionwithLBFGS using BinaryclassificationMetrics. I got all those. I was trying to figure out if I could get a graphical output of PR and AUC curve.

Pasting my Codes below:

import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
import org.apache.spark.mllib.evaluation.{BinaryClassificationMetrics, MulticlassMetrics}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}



object log_reg_eval_metric {

  def main(args: Array[String]): Unit = {


    System.setProperty("hadoop.home.dir", "c:\\winutil\\")


    val sc = new SparkContext(new SparkConf().setAppName("SparkTest").setMaster("local[*]"))

    val sqlContext = new org.apache.spark.sql.SQLContext(sc);

    val data: RDD[String] = sc.textFile("C:/Users/user/Documents/spark-1.5.1-bin-hadoop2.4/data/mllib/credit_approval_2_attr.csv")


    val parsedData = data.map { line =>
      val parts = line.split(',').map(_.toDouble)
      LabeledPoint(parts(0), Vectors.dense(parts.tail))
    }

    //Splitting the data
    val splits: Array[RDD[LabeledPoint]] = parsedData.randomSplit(Array(0.7, 0.3), seed = 11L)
    val training: RDD[LabeledPoint] = splits(0).cache()
    val test: RDD[LabeledPoint] = splits(1)



    // Run training algorithm to build the model
    val model = new LogisticRegressionWithLBFGS()
      .setNumClasses(2)
      .run(training)
    // Clear the prediction threshold so the model will return probabilities
    model.clearThreshold

    // Compute raw scores on the test set
    val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
      val prediction = model.predict(features)
      (prediction, label)
    }

    // Instantiate metrics object
    val metrics = new BinaryClassificationMetrics(predictionAndLabels)

    // Precision by threshold
    val precision = metrics.precisionByThreshold
    precision.foreach { case (t, p) =>
      println(s"Threshold: $t, Precision: $p")
    }


    // Precision-Recall Curve
    val PRC = metrics.pr

    print(PRC)



  }
}

output from print(PRC):

UnionRDD[39] at union at BinaryClassificationMetrics.scala:108

I am not sure what is an union RDD and how to use it. Is there any other way to get the graphical output. Doing my research on it. Any suggestion would be great.

4

1 回答 1

4

您可以使用spark.ml包中的 BinaryLogisticRegressionTrainingSummary。它提供开箱即用的 PR 和 ROC 值作为数据帧。

您可以将这些值输入到任何渲染实用程序以查看特定曲线。(任何具有 x 和 y 值的多线图都将显示曲线。)

于 2016-11-22T11:10:39.873 回答