string - 使用 R 识别复杂字符串中的多个 ID

Question

我有一个包含很多字符串和这样的值的数据框

ID String                                                    Value
1  LocationID=123,321,345&TimeID=456,321,789&TypeID=12,32    100
2  LocationID=123,345&TimeID=456,321                         50
3  LocationID=123,321,345&TypeID=32                          120
...

正如您在示例中看到的那样，“，”表示“或”。所以 locationID=123,321,345 指的是那些位置 ID 为 123、321 或 345 的元素。“值”可以认为是满足字符串的条目数。

我想编写一个程序来使用 R 计算每个 ID 的出现次数。即程序的输出应该是：

ID                Occurrence
LocationID = 123  270                          #(100+50+120)
LocationID = 321  220                          #(100+120)
...
TypeID = 12       100
...

谁能给我一些关于如何完成这项任务的建议？

我发现处理“，”和ID非常困难。否则我可以使用 for 循环，虽然我讨厌 for 循环.....

还有一个问题，ID 应该允许为空或字符，如下所示：

ID String                                                    Value
1  LocationID=123,321,345&TimeID=456,321,789&TypeID=         100
2  LocationID=123,345&TimeID=&TypeID=A                       50
3  LocationID=123,321,345&TypeID=32                          120

score 5 · Accepted Answer

试试这个。 lapply2就像lapply它是 rbind 之后的结果。我们拆分 String 参数并将结果放入s. 然后我们计算一个新的数据框dat2，每个 ID 都有一行。对于样本数据，第 1 行有 3 个 ID，第 2 行有 2 个 ID，第 3 行有 2 个 ID，因此dat23+2+2 = 7 行。以类似的方式，我们爆炸dat2产生dat3. 作为其中的一部分，我们使用它strapplyc来简化提取所有出现的过程。最后我们aggregate用来计算结果。

library(gsubfn)

lapply2 <- function(...) do.call("rbind", lapply(...))

s <- strsplit(dat$String, "&")

dat2 <- lapply2(1:nrow(dat), function(i) 
     data.frame(
            String = I(s[[i]]), 
            Value = dat$Value[i]
     )
)

dat3 <- lapply2(1:nrow(dat2), function(i) 
     data.frame(
            String = sub("=.*", "", dat2$String[i]), 
            Occurrence = strapplyc(dat2$String[i], "\\d+")[[1]], 
            Value = dat2$Value[i]
     )
)

ag <- aggregate(Value ~ String + Occurrence, dat3, sum)

结果是：

> ag
      String Occurrence Value
1 LocationID        123   270
2 LocationID        321   220
3     TimeID        321   150
4 LocationID        345   270
5     TimeID        456   150
6     TimeID        789   100
7     TypeID         12   100
8     TypeID         32   220

score 2 · Accepted Answer

G. Grothendieck 的回答要好得多，但由于我已经开始研究解决方案，所以就在这里。这坚持以 R 为基础并涉及 long lapply。假设您的数据名为“mydata”：

首先，用 & 号分割“字符串”列

temp1 <- strsplit(mydata$String, "&")

其次，这里有一个名为 in 的复杂匿名函数lapply。我已经对这些步骤进行了注释，因此您可以看到发生了什么。

temp2 <- do.call(
  "rbind", 
  lapply(seq_along(temp1), function(x) {
    # Set the pattern we're going to look for
    pattern <- "(.*)=(.*)"
    # Extract names and values
    Name <- gsub(pattern, "\\1", temp1[[x]])
    Measure <- gsub(pattern, "\\2", temp1[[x]])
    # Split the Measure value, and create a data.frame
    Output <- lapply(strsplit(Measure, ","), function(x) 
      data.frame(as.numeric(x)))
    names(Output) <- Name             # Add the names back to the list
    Output <- do.call(rbind, Output)  # rbind the sub-lists
    # Move the rownames to a column
    Output$Param <- gsub("(.*)\\.[0-9]+", "\\1", rownames(Output))
    rownames(Output) <- NULL          # Clean up the rownames
    names(Output)[1] <- "Measure"     # Rename the measure variable
    # Make a nice dataframe with your original data too.
    data.frame(ID = mydata[x, "ID"], Output, Value = mydata[x, "Value"])
  }))

结果如下所示：

temp2
#    ID Measure      Param Value
# 1   1     123 LocationID   100
# 2   1     321 LocationID   100
# 3   1     345 LocationID   100
# 4   1     456     TimeID   100
# 5   1     321     TimeID   100
# 6   1     789     TimeID   100
# 7   1      12     TypeID   100
# 8   1      32     TypeID   100
# 9   2     123 LocationID    50
# 10  2     345 LocationID    50
# 11  2     456     TimeID    50
# 12  2     321     TimeID    50
# 13  3     123 LocationID   120
# 14  3     321 LocationID   120
# 15  3     345 LocationID   120
# 16  3      32     TypeID   120

所以，现在我们可以很容易地aggregate在输出上使用来得到这个：

aggregate(Value ~ Param + Measure, temp2, sum)
#        Param Measure Value
# 1     TypeID      12   100
# 2     TypeID      32   220
# 3 LocationID     123   270
# 4 LocationID     321   220
# 5     TimeID     321   150
# 6 LocationID     345   270
# 7     TimeID     456   150
# 8     TimeID     789   100

为方便起见，以下dput是数据的前几行之一：

mydata <- structure(list(ID = 1:3, 
                         String = c("LocationID=123,321,345&TimeID=456,321,789&TypeID=12,32",
                                    "LocationID=123,345&TimeID=456,321", 
                                    "LocationID=123,321,345&TypeID=32"), 
                         Value = c(100L, 50L, 120L)), 
                    .Names = c("ID", "String", "Value"), 
                    row.names = c(NA, -3L), 
                    class = "data.frame")

score 1 · Accepted Answer

尝试使用 strsplit 函数，你可以像这样标记你的字符串

strsplit("LocationID=123,321,345&TimeID=456,321,789&TypeID=12,32","&"); ## this will tokenize by splitting by &;

然后使用 grep 确定 LocationID、TimeID、TypeID 的存在，并通过 '=' 适当地进行 strsplit，然后使用 ',' 将这些值附加到辅助帧中。

最后叫一个'tapply'

希望这有助于作为一个大致的轮廓

score 1 · Accepted Answer

你可以做这样的事情

dat <- read.table(text = 'ID String                                                    Value
1  LocationID=123,321,345&TimeID=456,321,789&TypeID=12,32    100
2  LocationID=123,345&TimeID=456,321                         50
3  LocationID=123,321,345&TypeID=32                          120',header= T, stringsAsFactors=F)
## split by &
ll <- unlist(strsplit(dat$String,'&'))
## create 2 lits of occuonces and id names
occs <- strsplit(gsub('(.*)ID=(.*)','\\2',ll),',')
ids <- gsub('(.*)ID=(.*)','\\1',ll)
names(occs) <- ids
ll <- sapply(names(occs),function(x) occs[x] <- paste(x,occs[[x]], sep ='_'))
## use rapply to change list in data.frame then count by table
table(rapply(ll,I))

Location_123 Location_321 Location_345     Time_321     Time_456     Time_789      Type_12      Type_32 
           3            3            3            2            2            2            2            2

string - 使用 R 识别复杂字符串中的多个 ID

4 回答 4

Related

Reference