r - 在 R 中解析多个字符串上的数据

Question

我正在尝试编写一个代码，该代码将解析包含多条信息的单个列。例如，假设我有以下名为 df 的数据框：

  ids             info
1 101       red;circle
2 103      circle;blue
3 122        red;green
4 102 circle;red;green
5 213             blue
6 170         red;blue

当我运行 table(df) 时，您会得到以下信息：

    table(df)
         info
    ids   blue circle;blue circle;red;green red;blue red;circle
      101    0           0                0        0          1
      102    0           0                1        0          0
      103    0           1                0        0          0
      122    0           0                0        0          0
      170    0           0                0        1          0
      213    1           0                0        0          0
         info
    ids   red;green
      101         0
      102         0
      103         0
      122         1
      170         0

  213         0

我想做的是1.将信息列分成2列，一列用于形状，一列用于颜色，2.将具有多种颜色的任何ID分配为“多色”。所以我写了以下内容：

df$shape <- as.character(df$info)
for (i in 1:dim(df)[1]){
  if (grepl("circle",df$info[i])==TRUE) {
    df$shape[i] <- "circle" 
  } else if (grepl("circle",df$info[i])==FALSE) {
    df$shape[i]<-NA}
}
for (i in 1:dim(df)[1]){
  if (grepl(";",df$info[i])==TRUE) {
    df$info[i] <- "Multicolored" 
  } else {df$info[i]<-df$info[i]}
}

从这段代码我得到输出：

df
  ids         info  shape
1 101 Multicolored circle
2 103 Multicolored circle
3 122 Multicolored   <NA>
4 102 Multicolored circle
5 213         blue   <NA>
6 170 Multicolored   <NA>

正如我的代码所写的那样，它是说像这样的实例101 red;circle是多色的，而实际上它不是，只是红色和一个圆圈。当“圆圈”可以出现在开头，中间或结尾的信息列中时，解析此数据的正确方法是什么。欢迎任何和所有建议，谢谢！

score 1 · Accepted Answer

;对于这种类型的问题，拆分字符串然后使用字符串向量可能是有意义的。例如，

mystrings <- strsplit(df$info,";")
getStrings <- function(x,s,none=NA_character_,multiple="Multicolored")
   switch(sum(x%in%s)+1,none,x[x%in%s],multiple,multiple)
df$shape <- sapply(mystrings,FUN=getStrings,s=c("circle"))
df$color <- sapply(mystrings,FUN=getStrings,s=c("red","green","blue"))

我个人发现这种方法比尝试使用纯正则表达式和 if 语句更容易。

score 0 · Accepted Answer

我喜欢@farnsy 的答案，但我也想发布我的解决方案，它基本相似，但不需要您指定颜色（假设所有非形状都是颜色）。

# Load the data
df <- read.table(textConnection('ids             info
1 101       red;circle
2 103      circle;blue
3 122        red;green
4 102 circle;red;green
5 213             blue
6 170         red;blue'),stringsAsFactors=FALSE)

# Split your column.
split.col <- strsplit(df$info,';')
# Specify which words are considered shapes.
shapes <- c('circle') # Could include more
# Find which rows had shapes.
df$shape <- sapply(split.col, function(x) x[match(shapes,x)[1]]) # Only selct one shape
# The rest must be colours, count them.
num.colours <- sapply(split.col, function(x) length(setdiff(x, shapes)))
df$multicoloured <- num.colours > 1

df
#   ids             info  shape multicoloured
# 1 101       red;circle circle         FALSE
# 2 103      circle;blue circle         FALSE
# 3 122        red;green   <NA>          TRUE
# 4 102 circle;red;green circle          TRUE
# 5 213             blue   <NA>         FALSE
# 6 170         red;blue   <NA>          TRUE

score 0 · Accepted Answer

你也可以试试：

 pat1 <- paste0(c("red","blue", "green"), collapse="|")
shape1 <- gsub(paste(pat1, ";", sep="|"), "", df$info)
shape1[shape1==''] <- NA
df[,c("info", "shape")] <- as.data.frame(do.call(rbind,
                Map(`c`, lapply(regmatches(df$info, gregexpr(pat1, df$info)), function(x)   {
             if(length(x)>1) "Multicolored" else x}), shape1)), stringsAsFactors=FALSE)

 df
 #  ids         info  shape
 #1 101          red circle
 #2 103         blue circle
 #3 122 Multicolored   <NA>
 #4 102 Multicolored circle
 #5 213         blue   <NA>
 #6 170 Multicolored   <NA>

r - 在 R 中解析多个字符串上的数据

3 回答 3

Related

Reference