pyspark - 如何将每个值与pyspark中的每个其他值进行比较？

Question

我在 spark 中有一个数据框，如下所示

  a    b
( 21 , 23 )
( 23 , 21 )
( 22 , 21 )
( 21 , 22 )

我想要一个看起来像这样的数据框：-

( 21 , 22 )
( 21 , 23 )
( 22 , 21 )
( 22 , 23 )
( 23 , 21 )
( 23 , 22 )

因此，它应该考虑两列的所有可能组合。如何做到这一点？

我尝试了笛卡尔连接，但是对于非常小的数据集来说需要太多时间。还有其他选择吗？

谢谢。

score 0 · Accepted Answer

0

尝试

zip(*pairs_rdd).flatten.deduplicate.foreach(n => (n,n-1)).cache()

于 2016-06-08T12:28:52.617 回答

score 0 · Accepted Answer

很难说为什么您join在没有看到您的代码的情况下“花费了太多时间”。我发现以下方法对我来说工作得相当快：

df = sqlContext.createDataFrame(
  [
    Row(a=21, b=22),
    Row(a=22, b=23),
  ]
)

# rename to avoid identical colume names in the result
df_copy = df.alias('df_copy')
df_copy = df_copy.withColumnRenamed('a', 'a_copy')
df_copy = df_copy.withColumnRenamed('b', 'b_copy')

df.join(df_copy, how='outer').select(df.a, df_copy.b_copy).collect()

pyspark - 如何将每个值与pyspark中的每个其他值进行比较？

2 回答 2

Related

Reference