apache-spark - 在 null 安全连接中使用 null 进行零连接

Question

我注意到在使用 null-safe join ( ) 时0加入。nulleqNullSafe

df1 = spark.createDataFrame([(1, ), (None, )], ['df1_id'])
df2 = spark.createDataFrame([(None, ), (0, )], ['df2_id'])

df1.join(df2, df1.df1_id.eqNullSafe(df2.df2_id), 'right').show()
#+------+------+
#|df1_id|df2_id|
#+------+------+
#|  null|     0|
#|  null|  null|
#+------+------+

df2.join(df1, df1.df1_id.eqNullSafe(df2.df2_id), 'left').show()
#+------+------+
#|df2_id|df1_id|
#+------+------+
#|     0|  null|
#|  null|  null|
#+------+------+

我如何null只加入null？

score 1 · Accepted Answer

您需要在这里进行内部连接

df1.join(df2, df1.df1_id.eqNullSafe(df2.df2_id), 'inner').show()

现在右边的 0 和左边的 df 没有匹配，我们正在做右连接，这就是为什么 pyspark 在右边的 df 中保持 0 并且它在df1_id.

apache-spark - 在 null 安全连接中使用 null 进行零连接

1 回答 1

Related

Reference