这个想法来自这个视频:https ://www.youtube.com/watch?v=BfaBeT0pRe0&t=526s ,他们在那里谈论通过实现自定义类型来实现类型安全。
一个可能的简单实现是
trait Col[Self] { self: Self =>
}
trait Id extends Col[Id]
object IdCol extends Id
trait Val extends Col[Val]
object ValCol extends Val
trait Comment extends Col[Comment]
object CommentCol extends Comment
case class DataSet[Schema >: Nothing](df: DataFrame) {
def validate[T1 <: Col[T1], T2 <: Col[T2]](
col1: (Col[T1], String),
col2: (Col[T2], String)
): Option[DataSet[Schema with T1 with T2]] =
if (df.columns
.map(e => e.toLowerCase)
.filter(
e =>
e.toLowerCase() == col1._2.toLowerCase || e
.toLowerCase() == col2._2.toLowerCase
)
.length >= 1)
Some(DataSet[Schema with T1 with T2](df))
else None
}
object SchemaTypes extends App {
lazy val spark: SparkSession = SparkSession
.builder()
.config(
new SparkConf()
.setAppName(
getClass()
.getName()
)
)
.getOrCreate()
import spark.implicits._
val df = Seq(
(1, "a", "first value"),
(2, "b", "second value"),
(3, "c", "third value")
).toDF("Id", "Val", "Comment")
val myData =
DataSet/*[Id with Val with Comment]*/(df)
.validate(IdCol -> "Id", ValCol -> "Val")
myData match {
case None => throw new java.lang.Exception("Required columns missing")
case _ =>
}
}
myData 的类型是Option[DataSet[Nothing with T1 with T2]]
. 这是有道理的,因为在没有任何类型参数的情况下调用构造函数,但在视频中它们显示的类型符合DataSet[T1 with T2]
.
当然,通过显式传递类型来更改调用Nothing
,但是指定类型参数值是多余的,因为类型已经包含在 arg 列表中。
val myData =
DataSet[Id with Val with Comment](df).validate(IdCol -> "Id", ValCol -> "Val")