2

我正在通过使用 Lending Club 的历史贷款数据集来学习 R。这里有代表性的数据子集:https ://gist.github.com/adetch/11b1c2b6eac0b6add23f

有问题的命令:

problem <- read.csv("test.csv",na.strings=c("","<NA>"),colClasses=c("mths_since_last_major_derog"="integer"))

我遇到的错误:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
scan() expected 'an integer', got '""'

我使用以下命令遇到了类似的问题:

problem <- read.csv("test.csv",na.strings=c("","<NA>"),colClasses=c("id"="integer"))

这种情况下的错误:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
scan() expected 'an integer', got '"1077501"'

所以在我看来/似乎

  • Rinteger类与引号不兼容
  • 并且可能na.strings转换在扫描类之后运行,并且integer对空字符串的检查失败。

但是,其他列包裹在""诸如member_idloan_amnt被强制转换为integer没有抱怨(并且也没有任何特殊干预使用colClasses!)。

最接近的问题:

  • 如何将这些字段 ( id, mths_since_last_major_derog) 转换为整数,而不是因子(注意还有许多其他字段应该转换为因子)

更重要的是:

  • 我的类、类强制、read.table/read.csv 等心智模型在哪里R崩溃?
4

1 回答 1

0

Don't use colClasses. If you really need to coerce, then read it as-is and then use$id <- as.integer(problem$id)

but in this case (your test.csv) i think R is pretty good at loading the data.

EDIT

Just to re-iterate. Imagine having a simple data table with 3 columns:

id,member_id,term
1077501,1296599,36 months
1077430,1314167,60 months
1077175,1313524,36 months
1076863,1277178,36 months

if you load the data via

d <- read.csv("c:/temp/R/data.csv")

then R will do its best to match the data types. If you really want to tell it upfront, using colClasses; if you say something like

d <- read.csv("c:/temp/R/data.csv", colClasses = c("integer"))

then it will try to use the class integer for every columns, since it is repeating the colClasses vector.

The same problem with

d <- read.csv("c:/temp/R/data.csv", colClasses = c("integer","character"))

Tries to load 1st column as colClasses[1], i.e. integer - OK.

Tries to load 2nd column as colClasses[2], i.e. character - OK.

Tries to load 3rd column as - since there is no colClasses[3] then it recycles colClasses and goes back to colClasses[1] - and this won't work (R doesnt know how to coerce '36 months' to an integer value)

EDIT2

AFter actually looking at the dataset, the problem is that your column does not have any values, and stores only "". So you need to add "" to your na.string and that will do the trick: (i.e. you need to escape the ", your actual string will be "\"\"")

problem <- read.csv("c:/temp/R/test.csv",na.strings=c("\"\"","","<NA>"),colClasses=c("mths_since_last_major_derog"="integer"))
于 2015-09-23T16:18:08.347 回答