我在每一行都有一个这种格式的文件:
f1,f2,f3,a1,a2,a3,...,an
这里,f1
,f2
, 和f3
是由 分隔的固定字段,,
但是f4
是可以变化的整体。a1,a2,...,an
n
如何将其读入R
并方便地将这些可变长度存储a1
到an
?
谢谢你。
我的文件如下所示
3,a,-4,news,finance
2,b,1,politics
1,a,0
2,c,2,book,movie
...
我在每一行都有一个这种格式的文件:
f1,f2,f3,a1,a2,a3,...,an
这里,f1
,f2
, 和f3
是由 分隔的固定字段,,
但是f4
是可以变化的整体。a1,a2,...,an
n
如何将其读入R
并方便地将这些可变长度存储a1
到an
?
谢谢你。
我的文件如下所示
3,a,-4,news,finance
2,b,1,politics
1,a,0
2,c,2,book,movie
...
目前尚不清楚您所说的“方便存储”是什么意思。如果您认为数据框适合您,请尝试以下操作:
df <- read.table(text = "3,a,-4,news,finance
2,b,1,politics
1,a,0
2,c,2,book,movie",
sep = ",", na.strings = "", header = FALSE, fill = TRUE)
names(df) <- c(paste0("f", 1:3), paste0("a", 1:(ncol(df) - 3)))
按照@Ananda Mahto 的评论进行编辑。
来自?read.table
:“数据列的数量是通过查看输入的前五行来确定的”。因此,如果包含数据的最大列数出现在前五行之后的某处,则上述解决方案将失败。
失败示例
# create a file with max five columns in the first five lines,
# and six columns in the sixth row
cat("3, a, -4, news, finance",
"2, b, 1, politics",
"1, a, 0",
"2, c, 2, book,movie",
"1, a, 0",
"2, c, 2, book, movie, news",
file = "df",
sep = "\n")
# based on the first five rows, read.table determines that number of columns is five,
# and creates an incorrect data frame
df <- read.table(file = "df",
sep = ",", na.strings = "", header = FALSE, fill = TRUE)
df
解决方案
# This can be solved by first counting the maximum number of columns in the text file
ncol <- max(count.fields("df", sep = ","))
# then this count is used in the col.names argument
# to handle the unknown maximum number of columns after row 5.
df <- read.table(file = "df",
sep = ",", na.strings = "", header = FALSE, fill = TRUE,
col.names = paste0("f", seq_len(ncol)))
df
# change column names as above
names(df) <- c(paste0("f", 1:3), paste0("a", 1:(ncol(df) - 3)))
df
一个开始的地方:
dat <- readLines(file) ## file being your file
df <- data.frame(
f1=sapply(dat_split, "[[", 1),
f2=sapply(dat_split, "[[", 2),
f3=sapply(dat_split, "[[", 3),
a=unlist( sapply(dat_split, function(x) {
if (length(x) <= 3) {
return(NA)
} else {
return(paste(x[4:length(x)], collapse=","))
}
}) )
)
当你需要把东西拉出来的时候a
,你可以根据需要进行拆分。
#
# Read example data
#
txt <- "3,a,-4,news,finance\n2,b,1,politics\n1,a,0\n2,c,2,book,movie"
tc = textConnection(txt)
lines <- readLines(tc)
close(tc)
#
# Solution
#
lines_split <- strsplit(lines, split=",", fixed=TRUE)
ind <- 1:3
df <- as.data.frame(do.call("rbind", lapply(lines_split, "[", ind)))
df$V4 <- lapply(lines_split, "[", -ind)
#
# Output
#
V1 V2 V3 V4
1 3 a -4 news, finance
2 2 b 1 politics
3 1 a 0
4 2 c 2 book, movie