id_no您可以在和上加入两个数据框start_date,然后coalesce在amount和days列中加入df2来自第一个的列:
import pyspark.sql.functions as f
df1.alias('a').join(
df2.alias('b'), ['id_no', 'start_date'], how='outer'
).select('id_no', 'start_date',
f.coalesce('b.amount', 'a.amount').alias('amount'),
f.coalesce('b.days', 'a.days').alias('days')
).show()
+-----+----------+------+----+
|id_no|start_date|amount|days|
+-----+----------+------+----+
| 1|2016-01-06| 3456| 20|
| 2|2016-01-20| 2345| 19|
| 1|2016-01-03| 4456| 22|
| 3|2016-02-02| 1345| 19|
| 2|2016-01-15| 1234| 45|
| 1|2016-01-01| 8650| 52|
| 2|2016-01-02| 7130| 65|
+-----+----------+------+----+
如果您有更多列:
cols = ['amount', 'days']
df1.alias('a').join(
df2.alias('b'), ['id_no', 'start_date'], how='outer'
).select('id_no', 'start_date',
*(f.coalesce('b.' + col, 'a.' + col).alias(col) for col in cols)
).show()
+-----+----------+------+----+
|id_no|start_date|amount|days|
+-----+----------+------+----+
| 1|2016-01-06| 3456| 20|
| 2|2016-01-20| 2345| 19|
| 1|2016-01-03| 4456| 22|
| 3|2016-02-02| 1345| 19|
| 2|2016-01-15| 1234| 45|
| 1|2016-01-01| 8650| 52|
| 2|2016-01-02| 7130| 65|
+-----+----------+------+----+