python - 在 PySpark 中加入多个列

Question

我想加入两个具有共同列名的 DataFrame。

我的数据框如下：

>>> sample3
DataFrame[uid1: string, count1: bigint]
>>> sample4
DataFrame[uid1: string, count1: bigint]


sample3
     uid1  count1
0  John         3
1  Paul         4
2  George       5

sample4
     uid1  count1
0  John         3
1  Paul         4
2  George       5

（我故意使用具有不同名称的相同 DataFrame）

我查看了 Spark 的 JIRA 问题 7197，他们解决了如何执行此连接（这与 PySpark 文档不一致）。但是，他们提出的方法会产生重复的列：

>>> cond = (sample3.uid1 == sample4.uid1) & (sample3.count1 == sample4.count1)
>>> sample3.join(sample4, cond)
DataFrame[uid1: string, count1: bigint, uid1: string, count1: bigint]

我想得到一个键没有出现两次的结果。

我可以用一列做到这一点：

>>>sample3.join(sample4, 'uid1')
DataFrame[uid1: string, count1: bigint, count1: bigint]

但是，相同的语法不适用于这种加入方法并引发错误。

我想得到结果：

DataFrame[uid1: string, count1: bigint]

我想知道这怎么可能

score 0 · Accepted Answer

在您的情况下，您可以使用键列表定义连接条件：

sample3.join(sample4, ['uid1','count1'])

python - 在 PySpark 中加入多个列

1 回答 1

Related

Reference