0

给定 dataframesdf_adf_b,我怎样才能获得与 left 不包括 join 相同的结果:

SELECT df_a.*
FROM df_a
  LEFT JOIN df_b
    ON df_a.id = df_b.id
WHERE df_b.id is NULL

我试过了:

df_a.join(df_b, df_a("id")===df_b("id"), "left")
  .select($"df_a.*")
  .where(df_b.col("id").isNull)

我从上面得到一个例外:

Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class scala.runtime.BoxedUnit ()
4

2 回答 2

2

如果您希望通过数据框执行此操作,请尝试以下示例:

  import sqlContext.implicits._
  val df1 = sc.parallelize(List("a", "b", "c")).toDF("key1")
  val df2 = sc.parallelize(List("a", "b")).toDF("key2")

  import org.apache.spark.sql.functions._

  df1.join(df2,
    df1.col("key1") <=> df2.col("key2"),
    "left")
    .filter(col("key2").isNull)
    .show

你会得到输出:

+----+----+
|key1|key2|
+----+----+
|   c|null|
+----+----+
于 2017-04-11T05:24:14.953 回答
1

您可以尝试执行 SQL 查询本身 - 保持简单..

df_a.registerTempTable("TableA")
df_b.registerTempTable("TableB")
result = sqlContext.sql("SELECT * FROM TableA A \
                          LEFT JOIN TableB B \
                          ON A.id = B.id \
                          WHERE B.id is NULL ")
于 2017-04-11T01:53:57.337 回答