我正在尝试使用其固定宽度结构读取此文件(3.8mb),如以下链接中所述。
这个命令:
a <- read.fwf('~/ccsl.txt',c(2,30,6,2,30,8,10,11,6,8))
产生错误:
第 37 行没有 10 个元素
在使用不同的跳过选项值复制问题后,我发现导致问题的行都包含“#”符号。
有没有办法绕过它?
我正在尝试使用其固定宽度结构读取此文件(3.8mb),如以下链接中所述。
这个命令:
a <- read.fwf('~/ccsl.txt',c(2,30,6,2,30,8,10,11,6,8))
产生错误:
第 37 行没有 10 个元素
在使用不同的跳过选项值复制问题后,我发现导致问题的行都包含“#”符号。
有没有办法绕过它?
As @jverzani already commented, this problem is probably the fact that the # sign often used as a character to signal a comment. Setting the comment.char
input argument of read.fwf
to something other than # could fix the problem. I'll leave my answer below as a more general case that you can use on any character that causes problems (e.g. the 's
in the Dutch city name 's Gravenhage
).
I've had this problem occur with other symbols. The approach I took was to simply replace the # by either nothing, or by a character which does not generate the error. In my case it was no problem to simply replace the character, but this might not be possible in your case.
So my approach would be to delete the symbol that generates the error, or replace by another character. This can be done using a text editor (find and replace), in an R script, or using some linux tools called grep
and sed
. If you want to do this in an R script, use scan
or readLines
to read the lines. Once the text is in memory, you can use sub
to replace the character.
If you cannot replace the character, I would try the following approach: replace the character by a character that does not generate an error, read it into R using read.fwf
, and finally replace the character by the # character.
跟进上面的答案:要让所有字符都被读取为文字,请在调用中同时使用comment.char=""
and quote=""
(后者处理 @PaulHiemstra 的荷兰专有名词中的单引号问题)read.fwf
(这在 中记录?read.table
)。