r - Replacing the source in click-stream data

Question

I have clickstream data for an ecommerce website. Some customers can opt to buy the product using a loan / finance option. Unfortunately this creates a new referral source - in the reprex below labeled 'finance'. It also creates a new session or sessions.

I would like to replace the source 'finance' with the source for the same user's preceding sessions' source.

In the example all observations for sessions 4-6871.2 & 4-6871.3 would have the source 'direct' as per session 4-6871.1, and 3-6871.1 would have 'google' as the source as per session 3-6871.0

I need to do this on a much larger data set, so I need to apply logic that looks for sessions with the 'finance' source and replace the instances of 'finance' with the immediately preceding source from the user's preceding session.

reprex data via dput:

structure(list(userId = c("6.154032", "6.154032", "6.154032", 
"6.154032", "6.154032", "6.154032", "6.154032", "6.154032", "6.154032", 
"8.154036", "8.154036", "8.154036", "8.154036", "8.154036", "8.154036", 
"8.154036", "8.154036", "8.154036", "8.154036", "8.154036", "8.154036", 
"8.154036", "8.154036"), session_Id = c("4-6871.0", "4-6871.0", 
"4-6871.0", "4-6871.1", "4-6871.1", "4-6871.1", "4-6871.2", "4-6871.2", 
"4-6871.3", "3-6871.0", "3-6871.0", "3-6871.0", "3-6871.0", "3-6871.0", 
"3-6871.1", "3-6871.1", "3-6871.1", "3-6871.1", "3-6871.1", "3-6871.1", 
"3-6871.1", "3-6871.1", "3-6871.1"), timeStamp = structure(c(1540294773, 
1540294828, 1540294841, 1540307321, 1540307341, 1540307718, 1540308709, 
1540308749, 1540311289, 1540330293, 1540330309, 1540330475, 1540330541, 
1540330663, 1540331041, 1540331164, 1540331168, 1540331312, 1540331459, 
1540331465, 1540331579, 1540331603, 1540331630), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), source = c("(direct)", "(direct)", 
"(direct)", "(direct)", "(direct)", "(direct)", "finance", "finance", 
"finance", "google", "google", "google", "google", "google", 
"finance", "finance", "finance", "finance", "finance", "finance", 
"finance", "finance", "finance")), class = c("tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -23L))

score 1 · Accepted Answer

也许您的完整数据结构中有些东西会使此解决方案无效，但这里有一个候选者：

df <- arrange(df, userId, timeStamp)
tmp <- rle(df$source)
tmp$values[tmp$values == "finance"] <- lag(tmp$values)[tmp$values == "finance"]
df$source <- inverse.rle(tmp)
table(df$source)
# (direct)   google 
#        9       14

在第一行中，我确保顺序正确。然后，假设没有用户他们的第一个来源可以立即是“金融”，在接下来的两行中，我将所有“金融”条目替换为前面的条目。

r - Replacing the source in click-stream data

1 回答 1

Related

Reference