pyspark - 如何在更短的时间内向初始 DataFrame 添加大量列（5000~1000 列）？

Question

我想对我创建的 pyspark 管道进行某种压力测试，并想测试输入数据帧的列（从 Hive 检索）是否增加到 2x ,5x 倍，那么管道将如何工作？

我尝试使用 for 循环创建数据框中已经存在的数字列的重复列：

for i in range(5000):
    df = df.withcolumn('abc_'+i,df.col1)

但这需要很多时间。有什么有效的方法吗？

score 0 · Accepted Answer

withColumn 方法可能会有一些开销，请尝试使用functionsand select，例如：

>>> dup_cols = [F.col('col_1').alias("abc_{}".format(i)) for i in range(1,10)]
>>> df_duplicated = df.select(df.columns + dup_cols)
>>> df.printSchema()
root
 |-- col_1: string (nullable = true)
 |-- date: string (nullable = true)
 |-- value: long (nullable = true)
 |-- id_1: string (nullable = true)
 |-- id_2: string (nullable = true)
 |-- id_3: string (nullable = true)
 |-- id_4: string (nullable = true)
 |-- id_5: string (nullable = true)
 |-- id_6: string (nullable = true)
 |-- id_7: string (nullable = true)
 |-- id_8: string (nullable = true)
 |-- id_9: string (nullable = true)

无论如何，由于这种操作在 Spark 中是惰性评估的，我不知道大量重复的列是否可以有效地针对实际大量不同的列进行测试。如果原始数据也以列优化格式（如镶木地板）保存，则这种差异可能会更大。

pyspark - 如何在更短的时间内向初始 DataFrame 添加大量列（5000~1000 列）？

1 回答 1

Related

Reference