我正在尝试编写一个 sql 查询以在 pyspark 中使用以清除 pyspark df 中的信息。我要修改的 df 如下所示:
hashed_customer firstname lastname email order_id status timestamp
eater 1_uuid 1_firstname 1_lastname 1_email 12345 OPTED_IN 2020-05-14 20:45:15
eater 2_uuid 2_firstname 2_lastname 2_email 23456 OPTED_IN 2020-05-14 20:29:22
eater 3_uuid 3_firstname 3_lastname 3_email 34567 OPTED_IN 2020-05-14 19:31:55
eater 4_uuid 4_firstname 4_lastname 4_email 45678 OPTED_IN 2020-05-14 17:49:27
我与需要从 customer_temp_tb 表中删除的客户有另一个 pyspark df,如下所示:
hashed_customer eaterstatus
eater 1_uuid OPTED_OUT
eater 3_uuid OPTED_OUT
我正在尝试编写一个 SQL 查询以在 pyspark 中使用,如果客户在第二个表中,它将从第一个表中删除名字、姓氏和电子邮件。有点像:
UPDATE customer_temp_tb
SET firstname="", lastname="", email=""
WHERE hashed_eater_uuid IN
(SELECT hashed_eater_uuid FROM opt_out_temp_tb)
这样最终结果将如下所示:
hashed_customer firstname lastname email order_id status timestamp
eater 1_uuid NaN NaN NaN 12345 OPTED_IN 2020-05-14 20:45:15
eater 2_uuid 2_firstname 2_lastname 2_email 23456 OPTED_IN 2020-05-14 20:29:22
eater 3_uuid NaN NaN NaN 34567 OPTED_IN 2020-05-14 19:31:55
eater 4_uuid 4_firstname 4_lastname 4_email 45678 OPTED_IN 2020-05-14 17:49:27
我似乎遇到的问题是 pyspark 不支持 UPDATE。还有其他选择吗?