1

我有一个带有 user_tag 列的数据框,我想拥有新的随机 UUID 值,我该怎么做?

--------------------------------------
| user_tag  |  pref_code  |  name    |
--------------------------------------
| abc123    |  Reg        |  Richard |
| abc123    |  Reg        |  Mort    |
| abc123    |  Disc       |  Jack    |

我想在 spark 中为 user_tag 生成 randomUUID。具有

-------------------------------------------------------------------
| user_tag                                |  pref_code  |  name    |
-------------------------------------------------------------------
| af3fb8b8-7ceb-4cec-ac27-2a034bb44bb9    |  Reg        |  Richard |
| snc22fls-2cgb-sas2-hc26-43d35ggg4522    |  Reg        |  Mort    |
| afgdw8b8-4fss-ycec-ycd7-haj3jbbj4bj9    |  Disc       |  Jack    |

我试过这个:但它导致每一行的 UUID 相同

val withUUID = dataFrame.withColumn("user_tag", 
  when(col("user_tag") === "abc123", randomUUID.toString).otherwise(col("user_tag")))
4

1 回答 1

0

您可以尝试通过创建udf然后调用 udf inside case when-then statement

例子:

val rand_UUID = udf(() => java.util.UUID.randomUUID().toString) //udf to generate randomUUID

val df=Seq(("abc123","Reg","Richard"),("abc123","Reg","Mort"))
       .toDF("user_tag","pref_code","name")

df.withColumn("user_tag",when('user_tag === "abc123",rand_UUID())
  .otherwise('user_tag))
  .show(false)

结果:

+------------------------------------+---------+-------+
|user_tag                            |pref_code|name   |
+------------------------------------+---------+-------+
|e0b3c917-dcc5-4c42-bfe3-32af18b1cfec|Reg      |Richard|
|90098d7d-8dc7-42df-a89b-5bd7f2c5cd99|Reg      |Mort   |
+------------------------------------+---------+-------+

基本上everymatch 会调用 udf 然后 generate randomUUID

于 2019-07-18T02:33:24.083 回答