I have clickstream data for an ecommerce website. Some customers can opt to buy the product using a loan / finance option. Unfortunately this creates a new referral source - in the reprex below labeled 'finance'. It also creates a new session or sessions.

I would like to replace the source 'finance' with the source for the same user's preceding sessions' source.

In the example all observations for sessions 4-6871.2 & 4-6871.3 would have the source 'direct' as per session 4-6871.1, and 3-6871.1 would have 'google' as the source as per session 3-6871.0

I need to do this on a much larger data set, so I need to apply logic that looks for sessions with the 'finance' source and replace the instances of 'finance' with the immediately preceding source from the user's preceding session.

reprex data via dput:

structure(list(userId = c("6.154032", "6.154032", "6.154032", 
"6.154032", "6.154032", "6.154032", "6.154032", "6.154032", "6.154032", 
"8.154036", "8.154036", "8.154036", "8.154036", "8.154036", "8.154036", 
"8.154036", "8.154036", "8.154036", "8.154036", "8.154036", "8.154036", 
"8.154036", "8.154036"), session_Id = c("4-6871.0", "4-6871.0", 
"4-6871.0", "4-6871.1", "4-6871.1", "4-6871.1", "4-6871.2", "4-6871.2", 
"4-6871.3", "3-6871.0", "3-6871.0", "3-6871.0", "3-6871.0", "3-6871.0", 
"3-6871.1", "3-6871.1", "3-6871.1", "3-6871.1", "3-6871.1", "3-6871.1", 
"3-6871.1", "3-6871.1", "3-6871.1"), timeStamp = structure(c(1540294773, 
1540294828, 1540294841, 1540307321, 1540307341, 1540307718, 1540308709, 
1540308749, 1540311289, 1540330293, 1540330309, 1540330475, 1540330541, 
1540330663, 1540331041, 1540331164, 1540331168, 1540331312, 1540331459, 
1540331465, 1540331579, 1540331603, 1540331630), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), source = c("(direct)", "(direct)", 
"(direct)", "(direct)", "(direct)", "(direct)", "finance", "finance", 
"finance", "google", "google", "google", "google", "google", 
"finance", "finance", "finance", "finance", "finance", "finance", 
"finance", "finance", "finance")), class = c("tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -23L))

1 回答 1



df <- arrange(df, userId, timeStamp)
tmp <- rle(df$source)
tmp$values[tmp$values == "finance"] <- lag(tmp$values)[tmp$values == "finance"]
df$source <- inverse.rle(tmp)
# (direct)   google 
#        9       14 


于 2018-10-28T00:19:43.887 回答