2

我的数据集:

+--------------------+----------+----------+-------------------+--------------------+--------+-------+---------+--------------------+-------------------+-------------------+-----------+----+---------------+-----------------+-----------+------------+----------------------+------------------+---------+
|          event_time|event_type|product_id|        category_id|       category_code|   brand|  price|  user_id|        user_session|   Event_time_NoUTC|    Event_timestamp|day_of_week|hour|primaryCategory|secondaryCategory|eventVisits|productCount|secondaryCategoryCount|     AvgCatExpense|SessCount|
+--------------------+----------+----------+-------------------+--------------------+--------+-------+---------+--------------------+-------------------+-------------------+-----------+----+---------------+-----------------+-----------+------------+----------------------+------------------+---------+
|2019-10-06 07:04:...|      view|   1004565|2053013555631882655|electronics.smart...|  huawei| 169.84|231943435|428ebb99-3568-4e1...|2019-10-06 07:04:50|2019-10-06 07:04:50|          1|   7|    electronics|       smartphone|          1|           1|                     1| 380.2349402627628|        1|
|2019-10-25 03:50:...|      view|   5100337|2053013553341792533|  electronics.clocks|   apple| 319.34|266287781|f55edf02-3fd4-48f...|2019-10-25 03:50:28|2019-10-25 03:50:28|          6|   3|    electronics|           clocks|          7|           7|                     7| 369.7054359810376|        4|
|2019-10-25 03:52:...|      view|   1005105|2053013555631882655|electronics.smart...|   apple|1397.09|266287781|118dbcd6-fe31-4cc...|2019-10-25 03:52:09|2019-10-25 03:52:09|          6|   3|    electronics|       smartphone|          7|           7|                     7| 369.7054359810376|        4|
|2019-10-26 12:15:...|      view|   6000157|2053013560807654091|auto.accessories....|starline|  91.12|266287781|992d03b4-c561-4fb...|2019-10-26 12:15:56|2019-10-26 12:15:56|          7|  12|           auto|      accessories|          7|           7|                     7| 369.7054359810376|        4|

事件类型分为三类:查看、购物车和购买。我想用一个新列 is_purchased=1 对 user_id 和 product_id 进行分类,如果它的事件类型为购买,其他为 0。之后,我将删除冗余行,如下所示,这基本上可以帮助我分类我的数据是否客户是否会流失。

删除冗余数据的图示

我正在考虑使用 user_id 和 product_id 对数据进行分区,然后对已购买的数据进行分类。请建议您解决此问题的方法?

4

2 回答 2

0

您还可以应用窗口函数并获取每个用户和产品的所有事件,然后过滤(我使用与 @werner 相同的示例数据)

from pyspark.sql import functions as F
from pyspark.sql import Window as W

(df
    .withColumn('events', F.collect_set('event').over(W.partitionBy('user', 'product')))
    .withColumn('is_purchased', F.array_contains(F.col('events'), 'purchase'))
    .withColumn('is_purchased', F.array_contains(F.col('events'), 'purchase'))
    .where(F.col('event') == 'cart')
    .show(10, False)
)

+----+-------+-----+------------------+----------------------+------------+
|user|product|event|other             |events                |is_purchased|
+----+-------+-----+------------------+----------------------+------------+
|A   |123    |cart |other attributes 2|[cart, view, purchase]|true        |
|B   |abc    |cart |other attributes 4|[cart]                |false       |
+----+-------+-----+------------------+----------------------+------------+
于 2021-05-22T03:58:45.547 回答
0

第 1 步user:按和对数据进行分组,product并标记每个组是否包含事件purchase

from pyspark.sql import functions as F

data = [("A",123, "view", "other attributes 1"),
        ("A",123, "cart", "other attributes 2"),
        ("A",123, "purchase", "other attributes 3"),
        ("B",123, "cart", "other attributes 4")]
df = spark.createDataFrame(data, schema = ["user", "product", "event", "other"])

is_purchased = df.groupBy("user", "product").agg(
    F.array_contains(F.collect_set("event"), "purchase").alias("is_purchased"))

# +----+-------+------------+
# |user|product|is_purchased|
# +----+-------+------------+
# |   A|    123|        true|
# |   B|    123|       false|
# +----+-------+------------+

第 2 步:将第 1 步的结果与原始数据连接并过滤掉冗余行:

result = df.join(is_purchased, on=["user", "product"], how="left") \
    .filter("event= 'cart'")

# +----+-------+-----+------------------+------------+
# |user|product|event|             other|is_purchased|
# +----+-------+-----+------------------+------------+
# |   A|    123| cart|other attributes 2|        true|
# |   B|    123| cart|other attributes 4|       false|
# +----+-------+-----+------------------+------------+
于 2021-05-04T18:10:28.870 回答