1

这个想法来自这个视频:https ://www.youtube.com/watch?v=BfaBeT0pRe0&t=526s ,他们在那里谈论通过实现自定义类型来实现类型安全。

一个可能的简单实现是

trait Col[Self] { self: Self =>
}

trait Id extends Col[Id]
object IdCol extends Id

trait Val extends Col[Val]
object ValCol extends Val

trait Comment extends Col[Comment]
object CommentCol extends Comment

case class DataSet[Schema >: Nothing](df: DataFrame) {

  def validate[T1 <: Col[T1], T2 <: Col[T2]](
      col1: (Col[T1], String),
      col2: (Col[T2], String)
  ): Option[DataSet[Schema with T1 with T2]] =
    if (df.columns
          .map(e => e.toLowerCase)
          .filter(
            e =>
              e.toLowerCase() == col1._2.toLowerCase || e
                .toLowerCase() == col2._2.toLowerCase
          )
          .length >= 1)
      Some(DataSet[Schema with T1 with T2](df))
    else None
}

object SchemaTypes extends App {

  lazy val spark: SparkSession = SparkSession
    .builder()
    .config(
      new SparkConf()
        .setAppName(
          getClass()
            .getName()
        )
    )
    .getOrCreate()

  import spark.implicits._

  val df = Seq(
    (1, "a", "first value"),
    (2, "b", "second value"),
    (3, "c", "third value")
  ).toDF("Id", "Val", "Comment")

  val myData =
    DataSet/*[Id with Val with Comment]*/(df)
      .validate(IdCol -> "Id", ValCol -> "Val")

  myData match {
    case None => throw new java.lang.Exception("Required columns missing")
    case _    =>
  }
}

myData 的类型是Option[DataSet[Nothing with T1 with T2]]. 这是有道理的,因为在没有任何类型参数的情况下调用构造函数,但在视频中它们显示的类型符合DataSet[T1 with T2].

当然,通过显式传递类型来更改调用Nothing,但是指定类型参数值是多余的,因为类型已经包含在 arg 列表中。

val myData =
  DataSet[Id with Val with Comment](df).validate(IdCol -> "Id", ValCol -> "Val")
4

2 回答 2

3

类型IdVal可以推断,因为有IdColValCol里面.validate。但Comment无法推断类型。所以试试

val myData =
  DataSet[Comment](df)
    .validate(IdCol -> "Id", ValCol -> "Val")

println(shapeless.test.showType(SchemaTypes.myData)) 
//Option[App.DataSet[App.Comment with App.Id with App.Val]]

https://scastie.scala-lang.org/yj0HnpkyQfCreKq8ZV4D7A

实际上,如果您指定DataSet[Id with Val with Comment](df)类型,则等于 ( )到.Option[DataSet[Id with Val with Comment with Id with Val]]=:=Option[DataSet[Id with Val with Comment]]


好的,我一直在看视频,直到那个时间码。我猜演讲者试图解释他们的想法(将 F 有界多态性T <: Col[T]与交叉类型相结合T with U),你不应该从字面上理解他们的幻灯片,那里可能存在不准确之处。

首先他们展示幻灯片

case class DataSet[Schema](df: DataFrame) {   
  def validate[T <: Col[T]](
    col: (Col[T], String)
  ): Option[DataSet[Schema with T]] = ??? 
}

这段代码可以用

val myDF: DataFrame = ???
val myData = DataSet[VideoId](myDF).validate(Country -> "country_code")
myData : Option[DataSet[VideoId with Country]]

然后他们展示幻灯片

val myData = DataSet(myDF).validate(
  VideoId -> "video_id",
  Country -> "country_code",
  ProfileId -> "profile_id",
  Score -> "score"
)

myData : DataSet[VideoId with Country with ProfileId with Score]

但是这个说明代码与上一张幻灯片不对应。你应该定义

// actually we don't use Schema here
case class DataSet[Schema](df: DataFrame) {
  def validate[T1 <: Col[T1], T2 <: Col[T2], T3 <: Col[T3], T4 <: Col[T4]](
    col1: (Col[T1], String),
    col2: (Col[T2], String),
    col3: (Col[T3], String),
    col4: (Col[T4], String),
  ): DataSet[T1 with T2 with T3 with T4] = ???
}

所以把它当作一个想法,而不是字面意思。

你可以有类似的东西

case class DataSet[Schema](df: DataFrame) {
  def validate[T <: Col[T]](
    col: (Col[T], String)
  ): Option[DataSet[Schema with T]] = ???
}

val myDF: DataFrame = ???

val myData = DataSet[Any](myDF).validate(VideoId -> "video_id").flatMap(
  _.validate(Country -> "country_code")
).flatMap(
  _.validate(ProfileId -> "profile_id")
).flatMap(
  _.validate(Score -> "score")
)

myData: Option[DataSet[VideoId with Country with ProfileId with Score]]
于 2020-07-05T21:25:51.900 回答
2

Dmytro Mitin 的回答很好,但我想提供更多信息。

如果您编写类似的内容,则首先推断出DataSet(df).validate(...)的类型参数。DataSet(df)这是Nothing因为没有任何信息可以使它成为其他任何东西。所以SchemaisNothingSchema with T1 with T2(出现在 的返回类型中validate)是Nothing with Id with Val

于 2020-07-06T07:46:41.493 回答