pyspark - 比较两个不同长度的列

Question

我正在使用两个 pyspark 数据框，每个都有一列。一个有 3 行（ColumnA），另一个有 100 行（ColumnB）。我想将ColumnA 的所有行与 ColumnB 的每一行进行比较。（我需要知道 ColumnA 中的任何日期是否大于 ColumnB 中的日期，如果是，请在 ColumnX 中添加 1）

任何建议，将不胜感激。谢谢！

在此处输入图像描述

score 1 · Accepted Answer

交叉连接是一种解决方案 -

例如 -

from pyspark.sql.types import *      
from pyspark.sql.functions import *  

A = [11, 2, 13, 4]
B = [5, 6]

df = spark.createDataFrame(A,IntegerType())
df1 = spark.createDataFrame(B,IntegerType())
df.select(col("value").alias("A")).crossJoin(df1.select(col("value").alias("B"))).withColumn("C",when(col("A") > col("B"),1).otherwise(0)).select("A","B","C").show()

+---+---+---+
|  A|  B|  C|
+---+---+---+
| 11|  5|  1|
| 11|  6|  1|
|  2|  5|  0|
|  2|  6|  0|
| 13|  5|  1|
| 13|  6|  1|
|  4|  5|  0|
|  4|  6|  0|
+---+---+---+

pyspark - 比较两个不同长度的列

1 回答 1

Related

Reference