1

我正在尝试编写一个代码,该代码将解析包含多条信息的单个列。例如,假设我有以下名为 df 的数据框:

  ids             info
1 101       red;circle
2 103      circle;blue
3 122        red;green
4 102 circle;red;green
5 213             blue
6 170         red;blue

当我运行 table(df) 时,您会得到以下信息:

    table(df)
         info
    ids   blue circle;blue circle;red;green red;blue red;circle
      101    0           0                0        0          1
      102    0           0                1        0          0
      103    0           1                0        0          0
      122    0           0                0        0          0
      170    0           0                0        1          0
      213    1           0                0        0          0
         info
    ids   red;green
      101         0
      102         0
      103         0
      122         1
      170         0

  213         0

我想做的是1.将信息列分成2列,一列用于形状,一列用于颜色,2.将具有多种颜色的任何ID分配为“多色”。所以我写了以下内容:

df$shape <- as.character(df$info)
for (i in 1:dim(df)[1]){
  if (grepl("circle",df$info[i])==TRUE) {
    df$shape[i] <- "circle" 
  } else if (grepl("circle",df$info[i])==FALSE) {
    df$shape[i]<-NA}
}
for (i in 1:dim(df)[1]){
  if (grepl(";",df$info[i])==TRUE) {
    df$info[i] <- "Multicolored" 
  } else {df$info[i]<-df$info[i]}
}

从这段代码我得到输出:

df
  ids         info  shape
1 101 Multicolored circle
2 103 Multicolored circle
3 122 Multicolored   <NA>
4 102 Multicolored circle
5 213         blue   <NA>
6 170 Multicolored   <NA>

正如我的代码所写的那样,它是说像这样的实例101 red;circle是多色的,而实际上它不是,只是红色和一个圆圈。当“圆圈”可以出现在开头,中间或结尾的信息列中时,解析此数据的正确方法是什么。欢迎任何和所有建议,谢谢!

4

3 回答 3

1

;对于这种类型的问题,拆分字符串然后使用字符串向量可能是有意义的。例如,

mystrings <- strsplit(df$info,";")
getStrings <- function(x,s,none=NA_character_,multiple="Multicolored")
   switch(sum(x%in%s)+1,none,x[x%in%s],multiple,multiple)
df$shape <- sapply(mystrings,FUN=getStrings,s=c("circle"))
df$color <- sapply(mystrings,FUN=getStrings,s=c("red","green","blue"))

我个人发现这种方法比尝试使用纯正则表达式和 if 语句更容易。

于 2014-08-22T17:12:12.967 回答
0

我喜欢@farnsy 的答案,但我也想发布我的解决方案,它基本相似,但不需要您指定颜色(假设所有非形状都是颜色)。

# Load the data
df <- read.table(textConnection('ids             info
1 101       red;circle
2 103      circle;blue
3 122        red;green
4 102 circle;red;green
5 213             blue
6 170         red;blue'),stringsAsFactors=FALSE)

# Split your column.
split.col <- strsplit(df$info,';')
# Specify which words are considered shapes.
shapes <- c('circle') # Could include more
# Find which rows had shapes.
df$shape <- sapply(split.col, function(x) x[match(shapes,x)[1]]) # Only selct one shape
# The rest must be colours, count them.
num.colours <- sapply(split.col, function(x) length(setdiff(x, shapes)))
df$multicoloured <- num.colours > 1

df
#   ids             info  shape multicoloured
# 1 101       red;circle circle         FALSE
# 2 103      circle;blue circle         FALSE
# 3 122        red;green   <NA>          TRUE
# 4 102 circle;red;green circle          TRUE
# 5 213             blue   <NA>         FALSE
# 6 170         red;blue   <NA>          TRUE
于 2014-08-22T17:26:14.850 回答
0

你也可以试试:

 pat1 <- paste0(c("red","blue", "green"), collapse="|")
shape1 <- gsub(paste(pat1, ";", sep="|"), "", df$info)
shape1[shape1==''] <- NA
df[,c("info", "shape")] <- as.data.frame(do.call(rbind,
                Map(`c`, lapply(regmatches(df$info, gregexpr(pat1, df$info)), function(x)   {
             if(length(x)>1) "Multicolored" else x}), shape1)), stringsAsFactors=FALSE)

 df
 #  ids         info  shape
 #1 101          red circle
 #2 103         blue circle
 #3 122 Multicolored   <NA>
 #4 102 Multicolored circle
 #5 213         blue   <NA>
 #6 170 Multicolored   <NA>
于 2014-08-22T18:04:24.630 回答