r - 将字符串拆分为 R 中的列，其中每个字符串具有可能不同数量的列条目

Question

我有一个具有以下形式的数据框

pages                         count
[page 1, page 2, page 3]      23
[page 2, page 4]              4
[page 1, page 3, page 4]      12

我需要做的是在逗号处拆分第一列并创建足够的新列来覆盖最长的序列。结果应该是：

First Page      Second Page  Third Page     Count
page 1          page 2       page 3         23
page 2          page 4       null           4
page 1          page 3       page 4         12

如果 null 是零长度字符串，我很好，我可以处理剥离括号。

score 4 · Accepted Answer

我的“splitstackshape”包具有解决此类问题的功能。在这种情况下，相关功能concat.split如下（使用 Ricardo 回答中的“myDat”）：

# Get rid of "[" and "]" from your "pages" variable
myDat$pages <- gsub("\\[|\\]", "", myDat$pages)
# Specify the source data.frame, the variable that needs to be split up
#   and whether to drop the original variable or not
library(splitstackshape)
concat.split(myDat, "pages", ",", drop = TRUE)
#   count pages_1 pages_2 pages_3
# 1    23  page 1  page 2  page 3
# 2     4  page 2  page 4        
# 3    12  page 1  page 3  page 4

score 3 · Accepted Answer

样本数据

myDat <- read.table(text=
  "pages|count
[page 1, page 2, page 3]|23
[page 2, page 4]|4
[page 1, page 3, page 4]|12", header=TRUE, sep="|")

我们可以pages退出myDat来处理它。

# if factors, convert to characters
pages <- as.character(myDat$page)

# remove brackets.  Note the double-escape's in R
pages <- gsub("(\\[|\\])", "", pages)

# split on comma
pages <- strsplit(pages, ",")

# find the largest element
maxLen <- max(sapply(pages, length))

# fill in any blanks. The t() is to transpose the return from sapply
pages <- 
t(sapply(pages, function(x)
      # append to x, NA's.  Note that if (0 == (maxLen - length(x))), then no NA's are appended 
      c(x, rep(NA, maxLen - length(x)))
  ))

# add column names as necessary
colnames(pages) <- paste(c("First", "Second", "Third"), "Page")

# Put it all back together
data.frame(pages, Count=myDat$count)

结果

> data.frame(pages, Count=myDat$count)
  First.Page Second.Page Third.Page Count
1     page 1      page 2     page 3    23
2     page 2      page 4       <NA>     4
3     page 1      page 3     page 4    12

score 2 · Accepted Answer

read.tablewithfill=TRUE可以填写它们。names(DF2)<-如果漂亮的列名不重要，则可以省略该行。不使用任何包。

# test data

Lines <- "pages                         count
[page 1, page 2, page 3]      23
[page 2, page 4]              4
[page 1, page 3, page 4]      12"

# code - replace text=Lines with something like "myfile.dat"

DF <- read.table(text = Lines, skip = 1, sep = "]", as.is = TRUE)
DF2 <- read.table(text = DF[[1]], sep = ",", fill = TRUE, as.is = TRUE)
names(DF2) <- paste0(read.table(text = Lines, nrow = 1, as.is = TRUE)[[1]], seq_along(DF2))
DF2$count <- DF[[2]]
DF2[[1]] <- sub(".", "", DF2[[1]]) # remove [

这给出了这个：

> DF2
  pages1  pages2  pages3 count
1 page 1  page 2  page 3    23
2 page 2  page 4             4
3 page 1  page 3  page 4    12

注意： 这给出了 page1、page2 等的列标题。如果在问题中准确显示列标题很重要，那么如果页面列少于 20 个，则用使用这些标题的行替换该行。

 ord <- c('First', 'Second', 'Third', 'Fourth', 'Fifth', 'Sixth', 'Seventh',
 'Eighth', 'Ninth', 'Tenth', 'Eleventh', 'Twelfth', 'Thirteenth',
 'Fourteenth', 'Fiftheenth', 'Sixteenth', 'Seventeenth', 'Eighteenth', 
 'Nineteenth')
ix <- seq_along(DF2)
names(DF2) <- if (ncol(DF2) < 20) paste(ord[ix], "Page") else paste("Page", ix)

r - 将字符串拆分为 R 中的列，其中每个字符串具有可能不同数量的列条目

3 回答 3

Related

Reference