这是我的加入:
df = df_small.join(df_big, 'id', 'leftanti')
看来我只能广播正确的数据帧。但是为了让我的逻辑起作用(leftanti join),我必须df_small
在左侧。
如何广播左侧的数据帧?
例子:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df_small = spark.range(2)
df_big = spark.range(1, 5000000)
# df_small df_big
# +---+ +-------+
# | id| | id|
# +---+ +-------+
# | 0| | 1|
# | 1| | 2|
# +---+ | ...|
# |4999999|
# +-------+
df_small = F.broadcast(df_small)
df = df_small.join(df_big, 'id', 'leftanti')
df.show()
df.explain()
# +---+
# | id|
# +---+
# | 0|
# +---+
#
# == Physical Plan ==
# AdaptiveSparkPlan isFinalPlan=false
# +- SortMergeJoin [id#197L], [id#199L], LeftAnti
# :- Sort [id#197L ASC NULLS FIRST], false, 0
# : +- Exchange hashpartitioning(id#197L, 200), ENSURE_REQUIREMENTS, [id=#1406]
# : +- Range (0, 2, step=1, splits=2)
# +- Sort [id#199L ASC NULLS FIRST], false, 0
# +- Exchange hashpartitioning(id#199L, 200), ENSURE_REQUIREMENTS, [id=#1407]
# +- Range (1, 5000000, step=1, splits=2)