4

我有一些点击流数据,我想以特定方式进行归因分析,但我需要为转换和不转换的用户输入特定格式。

代表数据:

df <- structure(list(User_ID = c(2001, 2001, 2001, 2002, 2001, 2002, 
                             2001, 2002, 2002, 2003, 2003, 2001, 2002, 2002, 2001), Session_ID = c("1001", 
                                                                                                   "1002", "1003", "1004", "1005", "1006", "1007", "Not Set", "Not Set", 
                                                                                                   "Not Set", "Not Set", "Not Set", "1008", "1009", "Not Set"), 
                 Date_time = structure(c(1540103940, 1540104060, 1540104240, 
                                         1540318080, 1540318680, 1540318859, 1540314360, 1540413060, 
                                         1540413240, 1540538460, 1540538640, 1540629660, 1540755060, 
                                         1540755240, 1540803000), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
                 Source = c("Facebook", "Facebook", "Facebook", "Google", 
                            "Email", "Google", "Email", "Referral", "Referral", "Facebook", 
                            "Facebook", "Google", "Referral", "Direct", "Direct"), Conversion = c(0, 
                                                                                                  0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1)), class = c("spec_tbl_df", 
                                                                                                                                                        "tbl_df", "tbl", "data.frame"), row.names = c(NA, -15L), spec = structure(list(
                                                                                                                                                          cols = list(User_ID = structure(list(), class = c("collector_double", 
                                                                                                                                                                                                            "collector")), Session_ID = structure(list(), class = c("collector_character", 
                                                                                                                                                                                                                                                                    "collector")), Date_time = structure(list(format = ""), class = c("collector_datetime", 
                                                                                                                                                                                                                                                                                                                                      "collector")), Source = structure(list(), class = c("collector_character", 
                                                                                                                                                                                                                                                                                                                                                                                          "collector")), Conversion = structure(list(), class = c("collector_double", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                  "collector"))), default = structure(list(), class = c("collector_guess", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "collector")), skip = 1), class = "col_spec"))

然后设置类:

df <- df %>% 
  mutate(User_ID    = as.factor(User_ID),
         Session_ID = as.factor(Session_ID),
         Date_time  = as.POSIXct(Date_time)
         )

我想获得购买的所有用户访问路径,或不导致购买的总路径。

新列的格式path例如:Facebook > Facebook > Facebook > Email > Email对于我知道如何使用的用户 2001 mutate(path = paste0(source, collapse = " > "))

并发症是:

  • 大多数会话 ID 未设置,这意味着它们丢失了
  • 一些用户可能会多次转换
  • 一些用户可能会转换并返回但不会转换

每行将是:

  • 按用户 ID 进行的转换 - 大多数转换的用户只转换一次,但有些可能会转换多次,在这种情况下,每次转换都会有一行。该path列将反映转化过程 - 对于用户的第二次或后续转化,只会显示上一次转化之后的路径。
  • 或未转换的用户旅程,其总路径采用上述格式

对于上述 reprex,结果如下所示:

# A tibble: 5 x 5
  User_ID Session_ID Date_time           Conversion Path                                          
    <dbl> <chr>      <dttm>                   <dbl> <chr>                                         
1    2001 1007       2018-10-23 17:06:00          1 Facebook > Facebook > Facebook > Email > Email
2    2002 Not Set    2018-10-24 20:34:00          1 Google > Google > Referral > Referral         
3    2003 Not Set    2018-10-26 07:24:00          0 Facebook > Facebook                           
4    2002 1009       2018-10-28 19:34:00          0 Referral > Direct                             
5    2001 Not Set    2018-10-29 08:50:00          1 Google > Direct     

... 在哪里:

  • 用户 2001 转换了两次,路径分别表示;
  • 用户 2002 已转换然后稍后返回但未转换,因此已转换和未转换的路径表示为单独的行。
  • 用户 2003 从未转换,因此表示此路径。
4

1 回答 1

3

这是一种使用方法dplyr

df2 <- df %>%
  # Add a column to distinguish between known and unknown sessions
  mutate(known_session = Session_ID != "Not Set") %>%

  # For each user, split between know and unknown sessions...
  group_by(User_ID, known_session) %>%
  # Sort first by Session ID, then time
  arrange(Session_ID, Date_time) %>%
  # Track which # path they're on. Start with path #1; 
  #   new path if prior event was a conversion
  mutate(path_num = cumsum(lag(Conversion, default = 0)) + 1) %>%

  # Label path journey by combining everything so far
  mutate(Path = paste0(Source, collapse = " > ")) %>%
  # Just keep last step in each path
  filter(row_number() == n()) %>%
  ungroup() %>%

  # Tidying up with just the desired columns, chronological
  select(User_ID, Session_ID, Date_time, Conversion, Path) %>%
  arrange(Date_time)

我得到的结果略有不同,但我认为它们对应于提供的示例数据:

> df2
# A tibble: 5 x 5
  User_ID Session_ID Date_time      

     Conversion Path                                          
  <fct>   <fct>      <dttm>                   <dbl> <chr>                                         
1 2001    1007       2018-10-23 17:06:00          1 Facebook > Facebook > Facebook > Email > Email
2 2002    Not Set    2018-10-24 20:34:00          1 Referral > Referral                           
3 2003    Not Set    2018-10-26 07:24:00          0 Facebook > Facebook                           
4 2002    1009       2018-10-28 19:34:00          0 Google > Google > Referral > Direct           
5 2001    Not Set    2018-10-29 08:50:00          1 Google > Direct  
于 2019-01-13T17:03:26.297 回答