r - 如何将结构松散的文本文件中的表格读入 R 中的数据框中？

Question

查看此 NOAA 网页上的“估计的全球趋势每日值”文件。它是一个.txt包含 50 个标题行（用前导#s 标识）后跟数千行表格数据的文件。下载文件的链接嵌入在下面的代码中。

我怎样才能读取这个文件，以便我最终得到一个具有适当列名和数据的数据框（或小标题）？

我所知道的所有文本到数据的功能都被这些标题行所阻碍。这是我刚刚尝试过的，从这个 SO Q&A中抽离出来。#我的想法是将文件读入行列表，然后从列表中删除以开头的行，然后删除do.call(rbind, ...)其余行。顶部的下载部分工作正常，但是当我运行该函数时，我得到一个空列表。

temp <- paste0(tempfile(), ".txt")
download.file("ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_trend_gl.txt",
              destfile = temp, mode = "wb")

processFile = function(filepath) {
  dat_list <- list()
  con = file(filepath, "r")
  while ( TRUE ) {
    line = readLines(con, n = 1)
    if ( length(line) == 0 ) {
      break
    }
    append(dat_list, line)
  }

  close(con)

  return(dat_list)

}

dat_list <- processFile(temp)

score 2 · Accepted Answer

这是一个可能的选择

processFile = function(filepath, header=TRUE, ...) {

  lines <- readLines(filepath)
  comments <- which(grepl("^#", lines))
  header_row <- gsub("^#","",lines[tail(comments,1)])
  data <- read.table(text=c(header_row, lines[-comments]), header=header, ...)

  return(data)

}

processFile(temp)

我们的想法是我们阅读所有行，找到以“#”开头的行并忽略它们，除了最后一个将用作标题的行。我们从标题中删除“#”（否则它通常被视为注释），然后将其传递read.table给以解析数据。

score 2 · Accepted Answer

以下是一些绕过您的功能的选项，您可以混合搭配。

在您已经知道列名的最简单（尽管不太可能）场景中，您可以read.table手动使用和输入列名。默认选项comment.char = "#"表示将省略这些注释行。

read.table(temp, col.names = c("year", "month", "day", "cycle", "trend"))

更有可能的是，您不知道这些列名，但可以通过找出有多少注释行来获取它们，然后只阅读这些行中的最后一行。这使您不必阅读比您需要的更多的文件；这是一个足够小的文件，它不应该有很大的不同，但在一个更大的文件中它可能会。我通过访问命令行进行计数，只是因为这是我知道的方式。另请注意，我将文件保存在更简单的路径；temp您可以将命令与变量一起粘贴。

同样，默认情况下会省略注释。

n_comments <- as.numeric(system("grep '^# ' co2.txt | wc -l", intern = TRUE))
hdrs <- scan(temp, skip = n_comments - 1, nlines = 1, what = "character")[-1]
read.table(temp, col.names = hdrs)

或者使用dplyrand stringr，读取所有行，分离出注释以提取列名，然后过滤以删除注释行并分成字段，分配您刚刚提取的列名。同样，对于更大的文件，这可能会变得很麻烦。

library(dplyr)

lines <- data.frame(text = readLines(temp), stringsAsFactors = FALSE)
comments <- lines %>%
  filter(stringr::str_detect(text, "^#"))

hdrs <- strsplit(comments[nrow(comments), 1], "\\s+")[[1]][-1]

lines %>%
  filter(!stringr::str_detect(text, "^#")) %>%
  mutate(text = trimws(text)) %>%
  tidyr::separate(text, into = hdrs, sep = "\\s+") %>%
  mutate_all(as.numeric)

r - 如何将结构松散的文本文件中的表格读入 R 中的数据框中？

2 回答 2

Related

Reference