53

我有一个稀疏数据集,其列数的长度不同,采用 csv 格式。这是文件文本的示例。

12223, University
12227, bridge, Sky
12828, Sunset
13801, Ground
14853, Tranceamerica
14854, San Francisco
15595, shibuya, Shrine
16126, fog, San Francisco
16520, California, ocean, summer, golden gate, beach, San Francisco

当我使用

read.csv("data.txt", header = F)

R 会将数据集解释为具有 3 列,因为大小是从前 5 行确定的。无论如何强制 r 将数据放在更多列中?

4

5 回答 5

71

Deep in the ?read.table documentation there is the following:

The number of data columns is determined by looking at the first five lines of input (or the whole file if it has less than five lines), or from the length of col.names if it is specified and is longer. This could conceivably be wrong if fill or blank.lines.skip are true, so specify col.names if necessary (as in the ‘Examples’).

Therefore, let's define col.names to be length X (where X is the max number of fields in your dataset), and set fill = TRUE:

dat <- textConnection("12223, University
12227, bridge, Sky
12828, Sunset
13801, Ground
14853, Tranceamerica
14854, San Francisco
15595, shibuya, Shrine
16126, fog, San Francisco
16520, California, ocean, summer, golden gate, beach, San Francisco")

read.table(dat, header = FALSE, sep = ",", 
  col.names = paste0("V",seq_len(7)), fill = TRUE)

     V1             V2             V3      V4           V5     V6             V7
1 12223     University                                                          
2 12227         bridge            Sky                                           
3 12828         Sunset                                                          
4 13801         Ground                                                          
5 14853  Tranceamerica                                                          
6 14854  San Francisco                                                          
7 15595        shibuya         Shrine                                           
8 16126            fog  San Francisco                                           
9 16520     California          ocean  summer  golden gate  beach  San Francisco

If the maximum number of fields is unknown, you can use the nifty utility function count.fields (which I found in the read.table example code):

count.fields(dat, sep = ',')
# [1] 2 3 2 2 2 2 3 3 7
max(count.fields(dat, sep = ','))
# [1] 7

Possibly helpful related reading: Only read limited number of columns in R

于 2013-09-20T17:52:01.693 回答
7

您可以像这样读取数据:

dat <- textConnection("12223, University
12227, bridge, Sky
12828, Sunset
13801, Ground
14853, Tranceamerica
14854, San Francisco
15595, shibuya, Shrine
16126, fog, San Francisco
16520, California, ocean, summer, golden gate, beach, San Francisco")

dat <- readLines(dat)
dat <- strsplit(dat, ",")

这会产生一个列表。

于 2013-09-20T17:39:09.210 回答
3

这似乎确实有效(遵循@BlueMagister 的建议):

tt <- read.table("~/Downloads/tmp.csv", fill=TRUE, header=FALSE, 
          sep=",", colClasses=c("numeric", rep("character", 6)))
names(tt) <- paste("V", 1:7, sep="")

     V1             V2             V3      V4           V5     V6             V7
1 12223     University                                                          
2 12227         bridge            Sky                                           
3 12828         Sunset                                                          
4 13801         Ground                                                          
5 14853  Tranceamerica                                                          
6 14854  San Francisco                                                          
7 15595        shibuya         Shrine                                           
8 16126            fog  San Francisco                                           
9 16520     California          ocean  summer  golden gate  beach  San Francisco
于 2013-09-20T17:52:25.363 回答
2

我遇到了类似的挑战,但count.fields来自 Blue Magister 的回答不起作用,可能是因为字段内的逗号与sep=",". 此外,列数因文件而异。所以我只是定义了多余col.namesread.table(在我的情况下 100 就足够了),然后我过去常常which(!is.na())摆脱多余的列。

dat <- read.table("path/to/file.csv", col.names = paste("V",1:100), fill = T, sep = ",")
dat <- dat[,which(!is.na(dat[1,]))]
于 2020-02-27T09:11:33.717 回答
1

试试这个,它更有活力..

readVariableWidthFile <- function(filePath){
  con <-file(filePath)
  lines<- readLines(con)
  close(con)
  slines <- strsplit(lines,",")
  colCount <- max(unlist(lapply(slines, length)))

  FileContent <- read.csv(filePath,
                        header = FALSE,
                        col.names = paste0("V",seq_len(colCount)),
                        fill = TRUE)
  return(FileContent)
}
于 2020-01-21T17:21:17.550 回答