0

我有一个 .txt 文件并且正在使用 Rstudio。

200416657210340 1665721 20040608 20090930 20060910 20070910 20080827 20090804
200416657210345 1665721 20040907 20090203 20070331 20080719                  
200416657210347 1665721 20040914 20091026 20070213 20080114 20090302         
200416657210352 1665721 20041111 20100315 20070123 20071205          20081202

我正在尝试使用 read.fwf 读取 .txt 文件:

gripalisti <- read.fwf(file = "gripalisti.txt",
                         widths = c(15,8,9,9,9,9,9,9),
                         header = FALSE,
                         #stringsAsFactors = FALSE, 
                       col.names = c("einst","bu","faeding","forgun","burdur1",
                                     "burdur2","burdur3","burdur4"))

这可行,并且列的长度正确。然而,“einst”和“bu”应该是整数值,其余的应该是日期。

导入第一列(ID 变量)中的所有值时,如下所示:

2.003140e+14

我一直在尝试寻找将导入的列更改为整数(或字符?)值的方法,但我没有发现任何不会导致错误的内容。一个例子,我在谷歌之后尝试过:

gripalisti <- read.fwf(file = "gripalisti.txt",
                         widths = c(15,8,9,9,9,9,9,9),
                         header = FALSE,
                         #stringsAsFactors = FALSE, 
                       col.names = c("einst","bu","faeding","forgun","burdur1",
                                     "burdur2","burdur3","burdur4"),
                       colclasses = c("integer", "integer", "Date", "Date",
                                      "Date", "Date", "Date", "Date"))

导致错误:

Error in read.table(file = FILE, header = header, sep = sep, row.names = row.names,  : 
  unused argument (colclasses = c("integer", "integer", "Date", "Date", "Date", "Date", "Date", "Date"))

数据集中有许多超过 100.000 行的缺失值。所以其他导入方式对我不起作用。数据集不是制表符分隔的。

对不起,如果这很明显,我是一个非常新的 R 用户。

编辑:

感谢您的帮助,我将其更改为:

 colClasses = c("character", 

现在看起来不错。

4

2 回答 2

1

正如评论中所建议的:

  1. 它是colClasses=,不是colclasses=,错字;
  2. 第一个字段不能存储为"integer",它必须是"numeric""character"
  3. (另外)这些日期不是默认格式%Y-%m-%d,您需要在读取数据后转换它们。

准备:

writeLines("200416657210340 1665721 20040608 20090930 20060910 20070910 20080827 20090804\n200416657210345 1665721 20040907 20090203 20070331 20080719                  \n200416657210347 1665721 20040914 20091026 20070213 20080114 20090302         \n200416657210352 1665721 20041111 20100315 20070123 20071205          20081202",
           con = "gripalisti.txt")

执行:

dat <- read.fwf("gripalisti.txt", widths = c(15,8,9,9,9,9,9,9), header = FALSE,
                col.names = c("einst","bu","faeding","forgun","burdur1", "burdur2","burdur3","burdur4"),
                colClasses = c("character", "integer", "character", "character", "character", "character", "character", "character"))
str(dat)
# 'data.frame': 4 obs. of  8 variables:
#  $ einst  : chr  "200416657210340" "200416657210345" "200416657210347" "200416657210352"
#  $ bu     : int  1665721 1665721 1665721 1665721
#  $ faeding: chr  " 20040608" " 20040907" " 20040914" " 20041111"
#  $ forgun : chr  " 20090930" " 20090203" " 20091026" " 20100315"
#  $ burdur1: chr  " 20060910" " 20070331" " 20070213" " 20070123"
#  $ burdur2: chr  " 20070910" " 20080719" " 20080114" " 20071205"
#  $ burdur3: chr  " 20080827" "         " " 20090302" "         "
#  $ burdur4: chr  " 20090804" "         " "         " " 20081202"

dat[,3:8] <- lapply(dat[,3:8], as.Date, format = "%Y%m%d")
dat
#             einst      bu    faeding     forgun    burdur1    burdur2    burdur3    burdur4
# 1 200416657210340 1665721 2004-06-08 2009-09-30 2006-09-10 2007-09-10 2008-08-27 2009-08-04
# 2 200416657210345 1665721 2004-09-07 2009-02-03 2007-03-31 2008-07-19       <NA>       <NA>
# 3 200416657210347 1665721 2004-09-14 2009-10-26 2007-02-13 2008-01-14 2009-03-02       <NA>
# 4 200416657210352 1665721 2004-11-11 2010-03-15 2007-01-23 2007-12-05       <NA> 2008-12-02

str(dat)
# 'data.frame': 4 obs. of  8 variables:
#  $ einst  : chr  "200416657210340" "200416657210345" "200416657210347" "200416657210352"
#  $ bu     : int  1665721 1665721 1665721 1665721
#  $ faeding: Date, format: "2004-06-08" "2004-09-07" "2004-09-14" "2004-11-11"
#  $ forgun : Date, format: "2009-09-30" "2009-02-03" "2009-10-26" "2010-03-15"
#  $ burdur1: Date, format: "2006-09-10" "2007-03-31" "2007-02-13" "2007-01-23"
#  $ burdur2: Date, format: "2007-09-10" "2008-07-19" "2008-01-14" "2007-12-05"
#  $ burdur3: Date, format: "2008-08-27" NA "2009-03-02" NA
#  $ burdur4: Date, format: "2009-08-04" NA NA "2008-12-02"
于 2021-05-30T18:37:15.537 回答
0

这里第一列中的数字是非常大的数字,如果您以整数或数字形式导入它,它将自动以指数格式显示。解决此问题的方法是在读取文件之前设置 scipen。使用下面的代码:

选项(scipen = 999)

在此处输入图像描述

我认为这应该可以解决您的问题。

下面是我运行的代码,当然对于您需要工作的日期列。为此,您可以使用简单的命令,例如 as.Date(gripalisti$burdur1, format = "%Y%m%d")

在此处输入图像描述

于 2021-05-30T17:56:07.640 回答