python - 如何在 pyspark 数据框中创建序列号列？

Question

我想在pyspark数据框中从指定的数字开始创建具有序列号的列。例如，我想将A列添加到我的数据帧df中，它将从5开始到我的数据帧的长度，递增 1，因此5、6、7、 ...、长度（df）。

使用pyspark方法的一些简单解决方案？

score 1 · Accepted Answer

三个简单的步骤：

从 pyspark.sql.window 导入窗口

从 pyspark.sql.functions 导入 monotonically_increasing_id,row_number

df =df.withColumn("row_idx",row_number().over(Window.orderBy(monotonically_increasing_id())))

score 1 · Accepted Answer

您可以使用范围来执行此操作

df_len = 100
freq =1
ref = spark.range(
    5, df_len, freq
).toDF("id")
ref.show(10)

+---+
| id|
+---+
|  5|
|  6|
|  7|
|  8|
|  9|
| 10|
| 11|
| 12|
| 13|
| 14|
+---+

仅显示前 10 行

score 0 · Accepted Answer

This worked for me. This creates sequential value into the column.

seed = 23
df.withColumn('label', seed+dense_rank().over(Window.orderBy('column')))

python - 如何在 pyspark 数据框中创建序列号列？

3 回答 3

Related

Reference