r - 在 R 中格式化数据

Question

假设我在 R 中导入一个 csv 文件来创建 R 数据集。现在这个文件有数字、字符、数据和百分比值。如何确保我导入的数据与原始文件中的数据格式相同。

在 SAS 中，我们通常有这种在导入时格式化数据的选项。这是示例

data test ;  
           infile "c:\mydocument\raw.csv" 
           delimiter = ',' MISSOVER DSD lrecl=32767
           firstobs=2 ;

           input 
              varA         
              varB         : $50.
              varC        : date9.
              varD      : Percent5.2
              varE      : $20.
;
run;

R中是否有任何选项可以执行相同的操作？如果有人可以给我一些参考，那就太好了！

基于以下答案的示例：

Local<-read.csv("C:\\Users\\Raw.csv",colClasses = c("character","character","Date","character","character","character","character","character","character","character","numeric","numeric", "numeric","numeric"),row.names=1)

我根据Dason 的示例使用了以下代码。但我收到以下错误。你能告诉我为什么会出现这个错误吗？你一直很有帮助。

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  scan() expected 'a real', got '.'

谢谢你。Rgds。

score 4 · Accepted Answer

read.csv 的 colClasses 参数是您想要的。来自?read.csv：

colClasses: character.  A vector of classes to be assumed for the
          columns.  Recycled as necessary, or if the character vector
          is named, unspecified values are taken to be ‘NA’.

          Possible values are ‘NA’ (the default, when ‘type.convert’ is
          used), ‘"NULL"’ (when the column is skipped), one of the
          atomic vector classes (logical, integer, numeric, complex,
          character, raw), or ‘"factor"’, ‘"Date"’ or ‘"POSIXct"’.
          Otherwise there needs to be an ‘as’ method (from package
          ‘methods’) for conversion from ‘"character"’ to the specified
          formal class.

          Note that ‘colClasses’ is specified per column (not per
          variable) and so includes the column of row names (if any).

一些示例使用

dat <- data.frame(num = 1:4, ch = letters[1:4])
write.csv(dat, file = "test.csv")
read.csv("test.csv", 
          colClasses = c(NA, "numeric", "character"),
          row.names = 1)
#  num ch
#1   1  a
#2   2  b
#3   3  c
#4   4  d
out <- read.csv("test.csv", 
                 colClasses = c(NA, "numeric", "character"),
                 row.names = 1)
str(out)
#'data.frame':  4 obs. of  2 variables:
# $ num: num  1 2 3 4
# $ ch : chr  "a" "b" "c" "d"

score 1 · Accepted Answer

In regard to your second error message, what is probably happening is that . is used as a special character, probably meant to show where there where NA's in the dataset. You can use the na.strings argument to tell read.csv which strings are considered NA.

r - 在 R 中格式化数据

2 回答 2

Related

Reference