我一直在尝试对一个完全匹配和一个部分进行一些繁琐的合并(在非常大的数据上)。我尝试了几种方法(使用 pmatch、str_detect、grep 和 sapply)并得到了一些接近的结果,但试图找到一个优雅的解决方案。任何帮助见解将不胜感激。
我发现的另一个较长的路线是对公共字段(seesionId)进行常规合并,然后编写一个如下所示的 for 循环:
for( i in 1:nrow(my.test.daa) ){
my.test.daa$Part_match [i] = pmatch(my.test.daa$Link_URL[i], my.test.daa$Referer[i])
...get index i to also get the other columns from dataset frame
}
新数据 - 有重复
pattern <- data.frame(SessionId = I(c("5b8cc8794a02ba868db21faef1",
"5b8cc8794a02ba868db21faef2",
"5b8cc8794a02ba868db21faef3",
"5b8cc8794a02ba868db21faef4",
"5b8cc8794a02ba868db21faef5",
"5b8cc8794a02ba868db21faef1")),
URL = I(c("somewebsite.com/abc/detail/110302288511/",
"somewebsite.com/abc/detail/110302288512/",
"somewebsite.com/abc/detail/110302288513/",
"somewebsite.com/abc/detail/110302288514/",
"somewebsite.com/abc/detail/110302288511/",
"somewebsite.com/abc/detail/110302288512/"
)))
dataset <- data.frame(SessionId = I(c("5b8cc8794a02ba868db21faef1",
"5b8cc8794a02ba868db21faef3",
"5b8cc8794a02ba868db21faef5",
"5b8cc8794a02ba868db21faef7",
"5b8cc8794a02ba868db21faef1"
)),
Referer = I(c("somewebsite.com/abc/detail/110302288511/110302288512/",
"somewebsite.com/abc/detail/110302288513/1103022815/",
"somewebsite.com/abc/detail/110302288513/11030228/",
"somewebsite.com/abc/detail/110302288465464/",
"somewebsite.com/abc/detail/110302288512/46545465/"
)))
OLD - 以下是 data.frams 的示例代码:
pattern <- data.frame(SessionId = I(c("5b8cc8794a02ba868db21faef1",
"5b8cc8794a02ba868db21faef2",
"5b8cc8794a02ba868db21faef3",
"5b8cc8794a02ba868db21faef4",
"5b8cc8794a02ba868db21faef5",
"5b8cc8794a02ba868db21faef6")),
URL = I(c("somewebsite.com/abc/detail/110302288511/",
"somewebsite.com/abc/detail/110302288512/",
"somewebsite.com/abc/detail/110302288513/",
"somewebsite.com/abc/detail/110302288514/",
"somewebsite.com/abc/detail/110302288511/",
"somewebsite.com/abc/detail/110302288512/"
)))
dataset <- data.frame(SessionId = I(c("5b8cc8794a02ba868db21faef1",
"5b8cc8794a02ba868db21faef3",
"5b8cc8794a02ba868db21faef5",
"5b8cc8794a02ba868db21faef7",
"5b8cc8794a02ba868db21faef2"
)),
Referer = I(c("somewebsite.com/abc/detail/110302288511/110302288512/",
"somewebsite.com/abc/detail/110302288513/1103022815/",
"somewebsite.com/abc/detail/110302288513/11030228/",
"somewebsite.com/abc/detail/110302288465464/",
"somewebsite.com/abc/detail/1103022846546/"
)))
新输出 - 重复
SessionId URL Referer
5b8cc8794a02ba868db21faef1 somewebsite.com/abc/detail/110302288511/ somewebsite.com/abc/detail/110302288511/110302288512/
5b8cc8794a02ba868db21faef3 somewebsite.com/abc/detail/110302288513/ somewebsite.com/abc/detail/110302288513/1103022815/
5b8cc8794a02ba868db21faef1 somewebsite.com/abc/detail/110302288512/ somewebsite.com/abc/detail/110302288512/46545465/
所以OLD输出需要如下所示:
SessionId URL Referer
5b8cc8794a02ba868db21faef1 somewebsite.com/abc/detail/110302288511/ somewebsite.com/abc/detail/110302288511/110302288512/
5b8cc8794a02ba868db21faef3 somewebsite.com/abc/detail/110302288513/ somewebsite.com/abc/detail/110302288513/1103022815/