6

我有一个具有以下形式的数据框

pages                         count
[page 1, page 2, page 3]      23
[page 2, page 4]              4
[page 1, page 3, page 4]      12

我需要做的是在逗号处拆分第一列并创建足够的新列来覆盖最长的序列。结果应该是:

First Page      Second Page  Third Page     Count
page 1          page 2       page 3         23
page 2          page 4       null           4
page 1          page 3       page 4         12

如果 null 是零长度字符串,我很好,我可以处理剥离括号。

4

3 回答 3

4

我的“splitstackshape”包具有解决此类问题的功能。在这种情况下,相关功能concat.split如下(使用 Ricardo 回答中的“myDat”):

# Get rid of "[" and "]" from your "pages" variable
myDat$pages <- gsub("\\[|\\]", "", myDat$pages)
# Specify the source data.frame, the variable that needs to be split up
#   and whether to drop the original variable or not
library(splitstackshape)
concat.split(myDat, "pages", ",", drop = TRUE)
#   count pages_1 pages_2 pages_3
# 1    23  page 1  page 2  page 3
# 2     4  page 2  page 4        
# 3    12  page 1  page 3  page 4
于 2013-02-25T04:49:37.440 回答
3

样本数据

myDat <- read.table(text=
  "pages|count
[page 1, page 2, page 3]|23
[page 2, page 4]|4
[page 1, page 3, page 4]|12", header=TRUE, sep="|") 

我们可以pages退出myDat来处理它。

# if factors, convert to characters
pages <- as.character(myDat$page)

# remove brackets.  Note the double-escape's in R
pages <- gsub("(\\[|\\])", "", pages)

# split on comma
pages <- strsplit(pages, ",")

# find the largest element
maxLen <- max(sapply(pages, length))

# fill in any blanks. The t() is to transpose the return from sapply
pages <- 
t(sapply(pages, function(x)
      # append to x, NA's.  Note that if (0 == (maxLen - length(x))), then no NA's are appended 
      c(x, rep(NA, maxLen - length(x)))
  ))

# add column names as necessary
colnames(pages) <- paste(c("First", "Second", "Third"), "Page")

# Put it all back together
data.frame(pages, Count=myDat$count)



结果

> data.frame(pages, Count=myDat$count)
  First.Page Second.Page Third.Page Count
1     page 1      page 2     page 3    23
2     page 2      page 4       <NA>     4
3     page 1      page 3     page 4    12
于 2013-02-24T21:08:20.767 回答
2

read.tablewithfill=TRUE可以填写它们。names(DF2)<-如果漂亮的列名不重要,则可以省略该行。不使用任何包。

# test data

Lines <- "pages                         count
[page 1, page 2, page 3]      23
[page 2, page 4]              4
[page 1, page 3, page 4]      12"

# code - replace text=Lines with something like "myfile.dat"

DF <- read.table(text = Lines, skip = 1, sep = "]", as.is = TRUE)
DF2 <- read.table(text = DF[[1]], sep = ",", fill = TRUE, as.is = TRUE)
names(DF2) <- paste0(read.table(text = Lines, nrow = 1, as.is = TRUE)[[1]], seq_along(DF2))
DF2$count <- DF[[2]]
DF2[[1]] <- sub(".", "", DF2[[1]]) # remove [

这给出了这个:

> DF2
  pages1  pages2  pages3 count
1 page 1  page 2  page 3    23
2 page 2  page 4             4
3 page 1  page 3  page 4    12

注意: 这给出了 page1、page2 等的列标题。如果在问题中准确显示列标题很重要,那么如果页面列少于 20 个,则用使用这些标题的行替换该行。

 ord <- c('First', 'Second', 'Third', 'Fourth', 'Fifth', 'Sixth', 'Seventh',
 'Eighth', 'Ninth', 'Tenth', 'Eleventh', 'Twelfth', 'Thirteenth',
 'Fourteenth', 'Fiftheenth', 'Sixteenth', 'Seventeenth', 'Eighteenth', 
 'Nineteenth')
ix <- seq_along(DF2)
names(DF2) <- if (ncol(DF2) < 20) paste(ord[ix], "Page") else paste("Page", ix)
于 2013-02-25T03:17:35.517 回答