r - 使用包 readxl 将 xlsx 数据导入 R 时指定列类型

Question

我正在将2007xlsx表导入到R 3.2.1patched使用. 表的大小约为 25,000 行 x 200 列。readxl 0.1.0Windows 7 64

功能read_excel()是一种享受。我唯一的问题是将列类（数据类型）分配给稀疏填充的列。例如，对于 20,000 行，给定的列可能是 NA，然后将在第 20,001 行取一个字符值。read_excel()在扫描列的前 n 行并仅查找时，似乎默认为列类型数字NAs。导致问题的数据是指定数字的列中的字符。当达到错误限制时，执行停止。我实际上想要稀疏列中的数据，因此将错误限制设置得更高不是解决方案。

我可以通过查看抛出的警告来识别麻烦的列。并且可以通过根据包文档read_excel()设置参数来断言列的数据类型：col_types

NULL从电子表格或包含blank、或的字符向量numeric中猜测。datetext

但这是否意味着我必须构建一个长度为 200 的向量，该向量几乎填充在每个位置blank以及与text违规列相对应的少数位置？

可能有一种方法可以在几行R代码中做到这一点。创建一个所需长度的向量并用blanks 填充它。可能是另一个包含要强制的列数的向量，text然后...read_excel()

我会很感激任何建议。

提前致谢。

score 12 · Accepted Answer

readxl自1.x 版以来的新解决方案：

当前首选答案中的解决方案不再适用于比 0.1.0 更新的版本，readxl因为使用的包内部功能readxl:::xlsx_col_types不再存在。

新的解决方案是使用新引入的参数guess_max来增加用于“猜测”列的适当数据类型的行数：

read_excel("My_Excel_file.xlsx", sheet = 1, guess_max = 1048576)

值 1,048,576 是 Excel 当前支持的最大行数，请参阅 Excel 规范： https: //support.office.com/en-us/article/Excel-specifications-and-limits-1672b34d-7043-467e-8e27 -269d656771c3

PS：如果您关心使用所有行来猜测数据类型的性能read_excel：似乎只读取文件一次并且猜测是在内存中完成的，那么与保存的工作相比，性能损失非常小。

score 6 · Accepted Answer

我遇到了类似的问题。

在我的情况下，空行和列被用作分隔符。表格中包含很多表格（格式不同）。因此，{openxlsx}包{readxl}不适合这种情况，导致 openxlsx 删除空列（并且没有参数可以更改此行为）。Readxl 包按照您的描述工作，一些数据可能会丢失。

结果，我认为，如果您想自动处理大量 excel 数据，最好的解决方案是在不更改“文本”格式的情况下读取工作表，然后根据您的规则处理 data.frames。

这个函数可以不加改动地读取表格（感谢@jack-wasey）：

loadExcelSheet<-function(excel.file, sheet)
{
  require("readxl")
  sheets <- readxl::excel_sheets(excel.file)
  sheet.num <- match(sheet, sheets) - 1
  num.columns <- length(readxl:::xlsx_col_types(excel.file, sheet =   sheet.num,
                                              nskip = 0, n = 1))

  return.sheet <- readxl::read_excel(excel.file, sheet = sheet,
                                col_types = rep("text", num.columns),
                                col_names = F)
  return.sheet 
}

score 6 · Accepted Answer

这取决于您的数据在不同列的不同位置是否稀疏，以及它的稀疏程度。我发现有更多的行并没有改善解析：大多数仍然是空白的，并被解释为文本，即使后来它们变成了日期等等。

一种解决方法是生成 Excel 表的第一个数据行以包含每一列的代表性数据，并使用它来猜测列类型。我不喜欢这样，因为我想保留原始数据。

如果您在电子表格的某处有完整的行，另一种解决方法是使用nskip而不是n. 这给出了列猜测的起点。假设数据行 117 有完整的数据集：

readxl:::xlsx_col_types(path = "a.xlsx", nskip = 116, n = 1)

请注意，您可以直接调用该函数，而无需在命名空间中编辑该函数。

然后您可以使用电子表格类型的向量来调用 read_excel：

col_types <- readxl:::xlsx_col_types(path = "a.xlsx", nskip = 116, n = 1)
dat <- readxl::read_excel(path = "a.xlsx", col_types = col_types)

然后您可以手动更新它仍然出错的任何列。

score 2 · Accepted Answer

阅读源代码，看起来列类型是由函数xls_col_typesor猜测的xlsx_col_types，它们在 Rcpp 中实现，但具有默认值：

xls_col_types <- function(path, na, sheet = 0L, nskip = 0L, n = 100L, has_col_names = FALSE) {
    .Call('readxl_xls_col_types', PACKAGE = 'readxl', path, na, sheet, nskip, n, has_col_names)
}

xlsx_col_types <- function(path, sheet = 0L, na = "", nskip = 0L, n = 100L) {
    .Call('readxl_xlsx_col_types', PACKAGE = 'readxl', path, sheet, na, nskip, n)
}

我的 C++ 非常生锈，但它看起来像是n=100L告诉要读取多少行的命令。

由于这些是非导出函数，请粘贴：

fixInNamespace("xls_col_types", "readxl")
fixInNamespace("xlsx_col_types", "readxl")

在弹出窗口中，将更改n = 100L为更大的数字。然后重新运行文件导入。

score 2 · Accepted Answer

查看源代码，我们看到有一个 Rcpp 调用返回猜测的列类型：

xlsx_col_types <- function(path, sheet = 0L, na = "", nskip = 0L, n = 100L) {
    .Call('readxl_xlsx_col_types', PACKAGE = 'readxl', path, sheet, na, nskip, n)
}

您可以看到默认情况下，nskip = 0L, n = 100L检查前 100 行以猜测列类型。您可以更改nskip为忽略标题文本并n通过执行以下操作增加（以更慢的运行时间为代价）：

col_types <-  .Call( 'readxl_xlsx_col_types', PACKAGE = 'readxl', 
                     path = file_loc, sheet = 0L, na = "", 
                     nskip = 1L, n = 10000L )

# if a column type is "blank", no values yet encountered -- increase n or just guess "text"
col_types[col_types=="blank"] <- "text"

raw <- read_excel(path = file_loc, col_types = col_types)

在不查看 .Rcpp 的情况下，我并不清楚是nskip = 0L跳过标题行（c++ 计数中的第零行）还是不跳过任何行。我通过使用避免了歧义nskip = 1L，因为跳过我的数据集的一行不会影响整个列类型的猜测。

score 1 · Accepted Answer

用于猜测列类型的内部函数可以设置为要扫描的任意行数。但read_excel()没有实现（还没有？）。

下面的解决方案只是对原始函数的重写，其read_excel()参数n_max默认为所有行。由于缺乏想象力，这个扩展功能被命名为read_excel2.

只需替换read_excel为read_excel2以评估所有行的列类型。

# Inspiration: https://github.com/hadley/readxl/blob/master/R/read_excel.R 
# Rewrote read_excel() to read_excel2() with additional argument 'n_max' for number
# of rows to evaluate in function readxl:::xls_col_types and
# readxl:::xlsx_col_types()
# This is probably an unstable solution, since it calls internal functions from readxl.
# May or may not survive next update of readxl. Seems to work in version 0.1.0
library(readxl)

read_excel2 <- function(path, sheet = 1, col_names = TRUE, col_types = NULL,
                       na = "", skip = 0, n_max = 1050000L) {

  path <- readxl:::check_file(path)
  ext <- tolower(tools::file_ext(path))

  switch(readxl:::excel_format(path),
         xls =  read_xls2(path, sheet, col_names, col_types, na, skip, n_max),
         xlsx = read_xlsx2(path, sheet, col_names, col_types, na, skip, n_max)
  )
}
read_xls2 <- function(path, sheet = 1, col_names = TRUE, col_types = NULL,
                     na = "", skip = 0, n_max = n_max) {

  sheet <- readxl:::standardise_sheet(sheet, readxl:::xls_sheets(path))

  has_col_names <- isTRUE(col_names)
  if (has_col_names) {
    col_names <- readxl:::xls_col_names(path, sheet, nskip = skip)
  } else if (readxl:::isFALSE(col_names)) {
    col_names <- paste0("X", seq_along(readxl:::xls_col_names(path, sheet)))
  }

  if (is.null(col_types)) {
    col_types <- readxl:::xls_col_types(
      path, sheet, na = na, nskip = skip, has_col_names = has_col_names, n = n_max
    )
  }

  readxl:::xls_cols(path, sheet, col_names = col_names, col_types = col_types, 
                    na = na, nskip = skip + has_col_names)
}

read_xlsx2 <- function(path, sheet = 1L, col_names = TRUE, col_types = NULL,
                       na = "", skip = 0, n_max = n_max) {
  path <- readxl:::check_file(path)
  sheet <-
    readxl:::standardise_sheet(sheet, readxl:::xlsx_sheets(path))

  if (is.null(col_types)) {
    col_types <-
      readxl:::xlsx_col_types(
        path = path, sheet = sheet, na = na, nskip = skip + isTRUE(col_names), n = n_max
      )
  }

  readxl:::read_xlsx_(path, sheet, col_names = col_names, col_types = col_types, na = na,
             nskip = skip)
}

由于这种扩展的猜测，您可能会受到不好的性能影响。还没有尝试过真正的大数据集，只是尝试了足够小的数据来验证功能。

r - 使用包 readxl 将 xlsx 数据导入 R 时指定列类型

6 回答 6

Related

Reference