这使用 countfields 标识具有额外字段的行,然后用于sub
删除两个 octothorpes ("#") 之间的逗号。鉴于数据集的大小,我猜会有更多问题,您应该会发现这count.fields
将非常有用(见下文):
> Lines
[1] "1,09/29/1951,F,N,22 MAIN STREET AVE,APT 3,SEATTLE,WA,98102-3053,00920670025\n2,09/28/1950,F,N,13354 A STREET,APT 2,BURLINGTON,VT,10101,02510070025\n3,10/18/1949,M,N,600 CENTRE STREET,BLDG #5,#104,SPRINGFIELD,IL,01010,02141650025\n4,10/18/1955,M,N,5 KELLY AVENUE,,CITY,XI,10101,02141650025"
> myLines <- readLines(textConnection(Lines))
> myLines
[1] "1,09/29/1951,F,N,22 MAIN STREET AVE,APT 3,SEATTLE,WA,98102-3053,00920670025"
[2] "2,09/28/1950,F,N,13354 A STREET,APT 2,BURLINGTON,VT,10101,02510070025"
[3] "3,10/18/1949,M,N,600 CENTRE STREET,BLDG #5,#104,SPRINGFIELD,IL,01010,02141650025"
[4] "4,10/18/1955,M,N,5 KELLY AVENUE,,CITY,XI,10101,02141650025"
> myLines[ count.fields(textConnection(myLines),sep=",", comment.char="") >10] <- sub("(#\\d+)(\\,)#", "\\1 &", myLines[ count.fields(textConnection(myLines),sep=",", comment.char="") >10])
> myLines
[1] "1,09/29/1951,F,N,22 MAIN STREET AVE,APT 3,SEATTLE,WA,98102-3053,00920670025"
[2] "2,09/28/1950,F,N,13354 A STREET,APT 2,BURLINGTON,VT,10101,02510070025"
[3] "3,10/18/1949,M,N,600 CENTRE STREET,BLDG #5 &104,SPRINGFIELD,IL,01010,02141650025"
[4] "4,10/18/1955,M,N,5 KELLY AVENUE,,CITY,XI,10101,02141650025"
> read.csv(text=myLines, comment.char="",header=FALSE)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 1 09/29/1951 F N 22 MAIN STREET AVE APT 3 SEATTLE WA 98102-3053 920670025
2 2 09/28/1950 F N 13354 A STREET APT 2 BURLINGTON VT 10101 2510070025
3 3 10/18/1949 M N 600 CENTRE STREET BLDG #5 &104 SPRINGFIELD IL 01010 2141650025
4 4 10/18/1955 M N 5 KELLY AVENUE CITY XI 10101 2141650025
我建议使用table(count.fields( filename, sep=",", comment.char="")))
来更好地估计问题的严重程度。我怀疑你只是找到了许多中的第一个。