我定义了一个UDF,它将输入值增加一,命名为“inc”,这是我的udf的代码
spark.udf.register("inc", (x: Long) => x + 1)
这是我的测试 sql
val df = spark.sql("select sum(inc(vals)) from data")
df.explain(true)
df.show()
这是那个sql的优化方案
== Optimized Logical Plan ==
Aggregate [sum(inc(vals#4L)) AS sum(inc(vals))#7L]
+- LocalRelation [vals#4L]
我想重写计划,并从“sum”中提取“inc”,就像 python udf 一样。所以,这是我想要的优化计划。
Aggregate [sum(inc_val#6L) AS sum(inc(vals))#7L]
+- Project [inc(vals#4L) AS inc_val#6L]
+- LocalRelation [vals#4L]
我发现源代码文件“ExtractPythonUDFs.scala”提供了与PythonUDF类似的功能,但它插入了一个名为“ArrowEvalPython”的新节点,这是pythonudf的逻辑计划。
== Optimized Logical Plan ==
Aggregate [sum(pythonUDF0#7L) AS sum(inc(vals))#4L]
+- Project [pythonUDF0#7L]
+- ArrowEvalPython [inc(vals#0L)], [pythonUDF0#7L], 200
+- Repartition 10, true
+- RelationV2[vals#0L] parquet file:/tmp/vals.parquet
我要插入的只是一个“项目节点”,我不想定义一个新节点。
这是我项目的测试代码
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.catalyst.expressions.{Expression, NamedExpression, ScalaUDF}
import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan}
import org.apache.spark.sql.catalyst.rules.Rule
object RewritePlanTest {
case class UdfRule(spark: SparkSession) extends Rule[LogicalPlan] {
def collectUDFs(e: Expression): Seq[Expression] = e match {
case udf: ScalaUDF => Seq(udf)
case _ => e.children.flatMap(collectUDFs)
}
override def apply(plan: LogicalPlan): LogicalPlan = plan match {
case agg@Aggregate(g, a, _) if (g.isEmpty && a.length == 1) =>
val udfs = agg.expressions.flatMap(collectUDFs)
println("================")
udfs.foreach(println)
val test = udfs(0).isInstanceOf[NamedExpression]
println(s"cast ScalaUDF to NamedExpression = ${test}")
println("================")
agg
case _ => plan
}
}
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.WARN)
val spark = SparkSession
.builder()
.master("local[*]")
.appName("Rewrite plan test")
.withExtensions(e => e.injectOptimizerRule(UdfRule))
.getOrCreate()
val input = Seq(100L, 200L, 300L)
import spark.implicits._
input.toDF("vals").createOrReplaceTempView("data")
spark.udf.register("inc", (x: Long) => x + 1)
val df = spark.sql("select sum(inc(vals)) from data")
df.explain(true)
df.show()
spark.stop()
}
}
我ScalaUDF
从Aggregate
节点中提取,
因为Project
Node 需要的参数是Seq[NamedExpression]
case class Project(projectList: Seq[NamedExpression], child: LogicalPlan)
但它未能投射ScalaUDF
到NamedExpression
,
所以我不知道如何构建Project
节点。
有人可以给我一些建议吗?
谢谢。