0

我正在运行以下代码并出现此类错误。

> rat <- scan("sortedratings.csv",nlines=760,sep=",",what=rat.cols,multi.line=FALSE);                                                       
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :                                                                         
  line 755 did not have 8 elements                                                                                                                    
>    

这是造成所有麻烦的线路

ubuntu@ip-10-28-6-239:/data/csv$ sed -n "750,760p" sortedratings.csv                                                                                  
"281656475","2.5.0","Jul 17, 2011","","","KK9876",4,0                                                                                                 
"281656475","2.5.0","Jul 17, 2011","","","Lyteskin45",4,0                                                                                             
"281656475","2.5.0","Jul 17, 2011","","","Mrs. Felton",5,0                                                                                            
"281656475","2.5.0","Jul 17, 2011","","","Nick Bartoszek",4,0                                                                                         
"281656475","2.5.0","Jul 17,2011","","","SANFRANPSYCHO",5,0                                                                                          
"281656475","2.5.0","Jul 17, 2011","","","Wxcgfduytrewjgf@!?$(:@&amp;&amp;$&amp;@\"",5,0                                                              
"281656475","2.5.0","Jul 18, 2011","","","Downs58",5,0                                                                                                
"281656475","2.5.0","Jul 18, 2011","","","kitty1019",5,0                                                                                              
"281656475","2.5.0","Jul 18, 2011","","","Rj&amp;e",4,0                                                                                               
"281656475","2.5.0","Jul 18, 2011","","","Robin Kinzer",5,0                                                                                           
"281656475","2.5.0","Jul 18, 2011","","","Roderick Palmer",5,0                                                                                        
ubuntu@ip-10-28-6-239:/data/csv$ s

我尝试了不同的修复方法,但我无法找出正确的修复方法。任何的想法?

对于没有文本或任何内容,我没有问题删除反斜杠。

哦,忘了补充,文件有 1.4GB 大,所以我无法读取所有文件,或者只是用 sed 替换它,因为它对我的系统来说太大了。

4

1 回答 1

4

?scan在(由 等使用read.table)的“详细信息”部分read.csv

 If ‘sep’ is non-default, the fields may be quoted in the style of
 ‘.csv’ files where separators inside quotes (‘''’ or ‘""’) are
 ignored and quotes may be put inside strings by doubling them.
 However, if ‘sep = "\n"’ it is assumed by default that one wants
 to read entire lines verbatim.

因此,看起来您的问题是该\"行中的“转义”引号引起了麻烦-R 期望 CSV 的转义引号是双引号"",而不是反斜杠引号\"

我认为您最好的选择是用双引号替换转义引号,无论是使用 Linux 还是在 R 中(下面的 R 示例):

txt <- readLines("tmp.txt")
txt <- gsub('\\\\"', '""', txt) # note the weird double backslashing because
                                # `readLines` adds extra backslashes
# if you `cat(txt, sep='\n')` you will see that the `\"` is now `""`

然后你可以使用read.csvor scanlike before (注意textConnection(txt)它将字符串转换为类似文件的对象以scan供使用):

read.csv(textConnection(txt), ...)

编辑/添加

回复 OP 的评论 - 该文件为 1.4GB,一次将其全部读入 R 有困难,那么如何进行消毒?

选项1

您似乎在 Linux 上,因此您可以使用sed

sed -ire 's!\\"!""!g' myfile.txt

(根据您的数据来自哪里,也许您可​​以调整输出数据的程序以首先以您需要的格式输出,但这并不总是可能的)。

选项 2

如果您不喜欢使用 Linux 或想要内部 R 解决方案,请使用n参数 toreadLines一次只读取几行:

# create the file object and open it, see ?file
f <- file('tmp.txt')
open(f)
txt <- ''

# now read in 100 lines at a time, say
while (length(txt)) {
    txt <- readLines(f, n=100)
    # now do the sanitizing/coercing into a data frame, store.
    # ...
}
close(f)
于 2012-11-20T02:02:34.440 回答