2

我有一个包含几百万行的文本文件。每行应该有 10 个变量。它是逗号分隔的,但每隔一段时间,变量中间就会有一个逗号(例如:第 3 行,“BLDG #5,#104”应该是一个变量,但是当我使用read.csv()它导入时,一切都搞砸了) . 这是一个例子:

1,09/29/1951,F,N,22 MAIN STREET AVE,APT 3,SEATTLE,WA,98102-3053,00920670025
2,09/28/1950,F,N,13354 A STREET,APT 2,BURLINGTON,VT,10101,02510070025
3,10/18/1949,M,N,600 CENTRE STREET,BLDG #5,#104,SPRINGFIELD,IL,01010,02141650025 
4,10/18/1955,M,N,5 KELLY AVENUE,,CITY,XI,10101,02141650025

有关如何最好地导入此数据的任何建议?

4

2 回答 2

1

这使用 countfields 标识具有额外字段的行,然后用于sub删除两个 octothorpes ("#") 之间的逗号。鉴于数据集的大小,我猜会有更多问题,您应该会发现这count.fields将非常有用(见下文):

> Lines
[1] "1,09/29/1951,F,N,22 MAIN STREET AVE,APT 3,SEATTLE,WA,98102-3053,00920670025\n2,09/28/1950,F,N,13354 A STREET,APT 2,BURLINGTON,VT,10101,02510070025\n3,10/18/1949,M,N,600 CENTRE STREET,BLDG #5,#104,SPRINGFIELD,IL,01010,02141650025\n4,10/18/1955,M,N,5 KELLY AVENUE,,CITY,XI,10101,02141650025"
> myLines <- readLines(textConnection(Lines))
> myLines
[1] "1,09/29/1951,F,N,22 MAIN STREET AVE,APT 3,SEATTLE,WA,98102-3053,00920670025"     
[2] "2,09/28/1950,F,N,13354 A STREET,APT 2,BURLINGTON,VT,10101,02510070025"           
[3] "3,10/18/1949,M,N,600 CENTRE STREET,BLDG #5,#104,SPRINGFIELD,IL,01010,02141650025"
[4] "4,10/18/1955,M,N,5 KELLY AVENUE,,CITY,XI,10101,02141650025"                      
> myLines[ count.fields(textConnection(myLines),sep=",", comment.char="") >10] <- sub("(#\\d+)(\\,)#", "\\1 &", myLines[ count.fields(textConnection(myLines),sep=",", comment.char="") >10])
> myLines
[1] "1,09/29/1951,F,N,22 MAIN STREET AVE,APT 3,SEATTLE,WA,98102-3053,00920670025"     
[2] "2,09/28/1950,F,N,13354 A STREET,APT 2,BURLINGTON,VT,10101,02510070025"           
[3] "3,10/18/1949,M,N,600 CENTRE STREET,BLDG #5 &104,SPRINGFIELD,IL,01010,02141650025"
[4] "4,10/18/1955,M,N,5 KELLY AVENUE,,CITY,XI,10101,02141650025"                      
> read.csv(text=myLines, comment.char="",header=FALSE)
  V1         V2 V3 V4                 V5           V6          V7 V8         V9        V10
1  1 09/29/1951  F  N 22 MAIN STREET AVE        APT 3     SEATTLE WA 98102-3053  920670025
2  2 09/28/1950  F  N     13354 A STREET        APT 2  BURLINGTON VT      10101 2510070025
3  3 10/18/1949  M  N  600 CENTRE STREET BLDG #5 &104 SPRINGFIELD IL      01010 2141650025
4  4 10/18/1955  M  N     5 KELLY AVENUE                     CITY XI      10101 2141650025

我建议使用table(count.fields( filename, sep=",", comment.char="")))来更好地估计问题的严重程度。我怀疑你只是找到了许多中的第一个。

于 2015-09-10T01:17:16.530 回答
1

为什么不只使用read.csv对您有利的东西呢?

dat <- read.csv(text="1,09/29/1951,F,N,22 MAIN STREET AVE,APT 3,SEATTLE,WA,98102-3053,00920670025
2,09/28/1950,F,N,13354 A STREET,APT 2,BURLINGTON,VT,10101,02510070025
3,10/18/1949,M,N,600 CENTRE STREET,BLDG #5,#104,SPRINGFIELD,IL,01010,02141650025 
4,10/18/1955,M,N,5 KELLY AVENUE,,CITY,XI,10101,02141650025", 
           header=FALSE, stringsAsFactors=FALSE, comment.char="", fill=TRUE)

for (i in 1:nrow(dat)) {
  if (is.na(dat[i, "V11"])) {
    dat[i, 8:11] <- dat[i, 7:10]
    dat[i, "V7"] <- NA
  }
}

dat

##   V1         V2 V3 V4                 V5      V6   V7          V8 V9        V10        V11
## 1  1 09/29/1951  F  N 22 MAIN STREET AVE   APT 3 <NA>     SEATTLE WA 98102-3053  920670025
## 2  2 09/28/1950  F  N     13354 A STREET   APT 2 <NA>  BURLINGTON VT      10101 2510070025
## 3  3 10/18/1949  M  N  600 CENTRE STREET BLDG #5 #104 SPRINGFIELD IL       1010 2141650025
## 4  4 10/18/1955  M  N     5 KELLY AVENUE         <NA>        CITY XI      10101 2141650025

如果你想结合V6+V7那么这是完全可行的。

以一种data.table天赋来做这件事会更有效率(即,如果有人发布了一个fread+ 纯data.table解决方案,它会得到“答案”的勾号。

于 2015-09-10T01:05:05.933 回答