pyspark - PySpark 添加 ID 列和过滤器损坏

Question

我有一个数据集，目前有 233,465 行，每天增长大约 10,000 行。我需要从完整数据集中随机选择行用于 ML 训练。我为“索引”添加了一个“id”列。

from pyspark.sql.functions import monotonically_increasing_id
spark_df = n_data.withColumn("id", monotonically_increasing_id())

我执行以下代码，期望看到返回 5 行，其中 id 与计数为 5 的“索引”列表匹配。

indices = [1000, 999, 45, 1001, 1823, 123476]
result = spark_df.filter(col("id").isin(indices))
result.show()
print(result.count())

相反，我得到 3 行。我得到了 45、1000 和 1001 的 ID。

关于这里可能有什么问题的任何想法？这看起来很简单。

谢谢！

score 0 · Accepted Answer

没有直接调用函数来为每一行分配唯一的、连续的 ID。但是有使用window-based 函数的解决方法。

df = spark.createDataFrame([(3,),(7,),(9,),(1,),(-3,),(5,)], ["values"])
df.show()

+------+
|values|
+------+
|     3|
|     7|
|     9|
|     1|
|    -3|
|     5|
+------+



df = (df.withColumn('dummy', F.monotonically_increasing_id())
       .withColumn('ID', F.row_number().over(Window.orderBy('dummy')))
       .drop('dummy'))
df.show()

+------+---+
|values| ID|
+------+---+
|     3|  1|
|     7|  2|
|     9|  3|
|     1|  4|
|    -3|  5|
|     5|  6|
+------+---+

pyspark - PySpark 添加 ID 列和过滤器损坏

1 回答 1

Related

Reference