2

我在每一行都有一个这种格式的文件:

f1,f2,f3,a1,a2,a3,...,an

这里,f1f2, 和f3是由 分隔的固定字段,,但是f4是可以变化的整体。a1,a2,...,ann

如何将其读入R并方便地将这些可变长度存储a1an

谢谢你。

我的文件如下所示

3,a,-4,news,finance
2,b,1,politics
1,a,0
2,c,2,book,movie
...
4

3 回答 3

2

目前尚不清楚您所说的“方便存储”是什么意思。如果您认为数据框适合您,请尝试以下操作:

df <- read.table(text = "3,a,-4,news,finance
2,b,1,politics
1,a,0
2,c,2,book,movie",
sep = ",", na.strings = "", header = FALSE, fill = TRUE) 

names(df) <- c(paste0("f", 1:3), paste0("a", 1:(ncol(df) - 3))) 

按照@Ananda Mahto 的评论进行编辑。
来自?read.table:“数据列的数量是通过查看输入的前五行来确定的”。因此,如果包含数据的最大列数出现在前五行之后的某处,则上述解决方案将失败。

失败示例

# create a file with max five columns in the first five lines,
# and six columns in the sixth row
cat("3, a, -4, news, finance",
"2, b, 1, politics",
"1, a, 0",
"2, c, 2, book,movie",
"1, a, 0",
"2, c, 2, book, movie, news",
file = "df",
sep = "\n")


# based on the first five rows, read.table determines that number of columns is five,
# and creates an incorrect data frame
df <- read.table(file = "df",
             sep = ",", na.strings = "", header = FALSE, fill = TRUE)
df

解决方案

# This can be solved by first counting the maximum number of columns in the text file
ncol <- max(count.fields("df", sep = ","))

# then this count is used in the col.names argument
# to handle the unknown maximum number of columns after row 5.
df <- read.table(file = "df",
       sep = ",", na.strings = "", header = FALSE, fill = TRUE,
       col.names = paste0("f", seq_len(ncol)))

df

# change column names as above
names(df) <- c(paste0("f", 1:3), paste0("a", 1:(ncol(df) - 3))) 
df
于 2013-08-25T00:27:56.117 回答
0

一个开始的地方:

dat <- readLines(file) ## file being your file
df <- data.frame(
  f1=sapply(dat_split, "[[", 1),
  f2=sapply(dat_split, "[[", 2),
  f3=sapply(dat_split, "[[", 3),
  a=unlist( sapply(dat_split, function(x) {
    if (length(x) <= 3) { 
      return(NA)
    } else {
      return(paste(x[4:length(x)], collapse=","))
    }
  }) )
)

当你需要把东西拉出来的时候a,你可以根据需要进行拆分。

于 2013-08-25T00:07:11.867 回答
0
#
# Read example data
#
txt <- "3,a,-4,news,finance\n2,b,1,politics\n1,a,0\n2,c,2,book,movie"
tc = textConnection(txt)
lines <- readLines(tc)
close(tc)
#
# Solution
#
lines_split <- strsplit(lines, split=",", fixed=TRUE)
ind <- 1:3
df <- as.data.frame(do.call("rbind", lapply(lines_split, "[", ind)))
df$V4 <- lapply(lines_split, "[", -ind) 
#
# Output
#
      V1 V2 V3            V4
1  3  a -4 news, finance
2  2  b  1      politics
3  1  a  0              
4  2  c  2   book, movie
于 2013-08-25T07:18:20.617 回答