这是我之前的问题Merging vectors of strings in a list in R
我尝试了另一种方法,使用data.table
.
我有一个data.table
G 如下
d <- list( c("SD1:LUSH", "SD44:CANCEL", "SD384:FR563", "SD32:TRUMPET"), c("SD23:SWITCH", "SD1:LUSH", "SD567:TREK"), c("SD42:CRAYON", "SD345:FOX", "SD183:WIRE"), c("SD345:HOLE", "SD340:DUST", "SD387:ROLL"), c("SD455:TOMATO", "SD39:MATURE"), c("SD12:PAINTING", "SD315:MONEY31", "SD387:SPRING"), c("SD32:TRUMPET", "SD1:FIELD"))
d2 <- lapply(d, function(x) sapply(strsplit(x, ":"), "[", 1))
d <- lapply(d, paste0, collapse=", ")
d2 <- lapply(d2, paste0, collapse=", ")
d <- as.data.frame(as.matrix(lapply(d, paste0, collapse=", ")))
d2 <- as.data.frame(as.matrix(lapply(d2, paste0, collapse=", ")))
d <- as.data.frame(cbind(d,d2))
colnames(d) <- c("sdw", "sd")
d$sd <- as.character(d$sd)
d$sdw <- as.character(d$sdw)
G <- data.table( d , key = "sd" )
sdw sd
1: SD1:LUSH, SD44:CANCEL, SD384:FR563, SD32:TRUMPET SD1, SD44, SD384, SD32
2: SD12:PAINTING, SD315:MONEY31, SD387:SPRING SD12, SD315, SD387
3: SD23:SWITCH, SD1:LUSH, SD567:TREK SD23, SD1, SD567
4: SD32:TRUMPET, SD1:FIELD SD32, SD1
5: SD345:HOLE, SD340:DUST, SD387:ROLL SD345, SD340, SD387
6: SD42:CRAYON, SD345:FOX, SD183:WIRE SD42, SD345, SD183
7: SD455:TOMATO, SD39:MATURE SD455, SD39
我正在尝试根据 sd 列中的元素聚合 sdw 列中的元素。
[1]、[2] 和 [7] 之间有共同的SD1。所以它们对应的 sdw 元素应该合并在一起。[1] 和 [7] 的SD1和SD32也是通用的。
[4] 具有与 [3]通用的SD345和与 [5] 通用的SD387。所以 [4]、[3] 和 [5] sdw 元素应该合并在一起。
[7] 没有任何与其他向量共同的SD__,因此它应该保持原样。
简而言之,我想根据G$sd 中重叠的SD__术语聚合 G$sdw 元素
我正在寻找的输出如下,只有三行。
[1] "SD1:LUSH, SD1:FIELD, SD23:SWITCH, SD32:TRUMPET, SD44:CANCEL, SD384:FR563, SD567:TREK"
[2] "SD12:PAINTING, SD42:CRAYON, SD183:WIRE, SD340:DUST SD345:FOX, SD345:HOLE, SD387:SPRING, SD387:ROLL"
[3] "SD455:TOMATO, SD39:MATURE"
我试过data.table
包如下
# Extract SDs from GN$sd
G <- G[ , list( ID = unlist( strsplit( sd , "," ) ) ) , by = list(sdw, sd) ]
G$ID <- gsub(" ", "", G$ID)
G <- data.table( G , key = "ID" )
# Merge according to common IDs
G2 <- G[, list(Gp1 = paste0(sort(unique(unlist(strsplit(sdw, split=", ")))), collapse=", "),
Gp2 = paste0(sort(unique(unlist(strsplit(sd, split=", ")))), collapse=", ")) , by = "ID"]
G2 <- data.table( G, key="Gp2")
G2 <- unique(G2)
G2
ID Gp1 Gp2
1: SD1 SD1:FIELD, SD1:LUSH, SD23:SWITCH, SD32:TRUMPET, SD384:FR563, SD44:CANCEL, SD567:TREK SD1, SD23, SD32, SD384, SD44, SD567
2: SD23 SD1:LUSH, SD23:SWITCH, SD567:TREK SD1, SD23, SD567
3: SD32 SD1:FIELD, SD1:LUSH, SD32:TRUMPET, SD384:FR563, SD44:CANCEL SD1, SD32, SD384, SD44
4: SD387 SD12:PAINTING, SD315:MONEY31, SD340:DUST, SD345:HOLE, SD387:ROLL, SD387:SPRING SD12, SD315, SD340, SD345, SD387
5: SD12 SD12:PAINTING, SD315:MONEY31, SD387:SPRING SD12, SD315, SD387
6: SD345 SD183:WIRE, SD340:DUST, SD345:FOX, SD345:HOLE, SD387:ROLL, SD42:CRAYON SD183, SD340, SD345, SD387, SD42
7: SD183 SD183:WIRE, SD345:FOX, SD42:CRAYON SD183, SD345, SD42
8: SD340 SD340:DUST, SD345:HOLE, SD387:ROLL SD340, SD345, SD387
9: SD39 SD39:MATURE, SD455:TOMATO SD39, SD455
这只能基于G$sd中跨行的SD__项的重复进行合并。它没有考虑跨元素的多个通用术语,以及具有与其他元素不同的通用术语的相同元素。
有什么办法可以在R
. 我的完整数据集有数千个这样的行。