r - 将文本分隔到 R 中的变量

Question

我在表格的一列中有这个：

paragemcard-resp+insufcardioresp
dpco+pneumonia
posopperfulceragastrica+ards
pos op hematoma #rim direito expontanea
miopatiaduchenne-erb+insuf.resp
dpco+dhca+#femur
posde#subtroncantГ©ricaesqВЄ+complicepidural
dpco+asma

我想像这样分开它们：

paragemcard-resp                            insufcardioresp
dpco                                        pneumonia
posopperfulceragastrica                     ards
pos op hematoma #rim direito expontanea
miopatiaduchenne-erb                        insuf.resp
dpco                                        dhca                   #femur
posde#subtroncantГ©ricaesqВЄ                complicepidural
dpco                                        asma

但问题是它们的长度不同。如您所见，在第 3 行，我们有 2 个变量，在第 6 行，我们有 3 个。

我想在同一列中创建此字符串以进行进一步分析。

谢谢

score 2 · Accepted Answer

您可以使用read.table，但您应该首先使用count.fields或某种正则表达式来确定正确的列数。使用罗伯特的“文本”样本数据：

Cols <- max(sapply(gregexpr("+", text, fixed = TRUE), length))+1
## Cols <- max(count.fields(textConnection(text), sep = "+"))

read.table(text = text, comment.char="", header = FALSE, 
           col.names=paste0("V", sequence(Cols)), 
           fill = TRUE, sep = "+")
#                                        V1              V2     V3
# 1                        paragemcard-resp insufcardioresp       
# 2                                    dpco       pneumonia       
# 3                 posopperfulceragastrica            ards       
# 4 pos op hematoma #rim direito expontanea                       
# 5                    miopatiaduchenne-erb      insuf.resp       
# 6                                    dpco            dhca #femur
# 7            posde#subtroncantГ©ricaesqВЄ complicepidural       
# 8                                    dpco            asma

此外，可能有用：“stringi”库使计算元素变得容易（作为上述gregexpr步骤的替代方法）。

library(stringi)
Cols <- max(stri_count_fixed(x, "+") + 1)

为什么需要“Cols”步骤？read.table家族通过 (1) 在前 5 行数据中检测到的最大字段数或 (2) 参数的长度来决定使用多少列col.names。在您的示例行中，字段数量最多的是第六行，因此直接使用read.csvorread.table会导致数据包装不正确。

score 2 · Accepted Answer

您可以使用strsplit：

text <- c("paragemcard-resp+insufcardioresp", "dpco+pneumonia", "posopperfulceragastrica+ards", "pos op hematoma #rim direito expontanea", "miopatiaduchenne-erb+insuf.resp", "dpco+dhca+#femur", "posde#subtroncantГ©ricaesqВЄ+complicepidural", "dpco+asma")

strings <- strsplit(text, "+", fixed = TRUE)
maxlen <- max(sapply(strings, length))
strings <- lapply(strings, function(s) { length(s) <- maxlen; s })
strings <- data.frame(matrix(unlist(strings), ncol = maxlen, byrow = TRUE))

它看起来像

                                          X1              X2     X3
   1                        paragemcard-resp insufcardioresp   <NA>
   2                                    dpco       pneumonia   <NA>
   3                 posopperfulceragastrica            ards   <NA>
   4 pos op hematoma #rim direito expontanea            <NA>   <NA>
   5                    miopatiaduchenne-erb      insuf.resp   <NA>
   6                                    dpco            dhca #femur
   7            posde#subtroncantГ©ricaesqВЄ complicepidural   <NA>
   8                                    dpco            asma   <NA>

r - 将文本分隔到 R 中的变量

2 回答 2

Related

Reference