r - 在R中读取具有多个空格作为分隔符的文本文件

Question

我有大约 94 列和 300 万行的大数据集。该文件具有单个和多个空格作为列之间的分隔符。我需要从 R 中的这个文件中读取一些列。为此，我尝试使用 read.table() 和选项，这些选项可以在下面的代码中看到，代码粘贴在下面 -

### Defining the columns to be read from the file, the first 5 column, then we do not read next 24, after this we read next 5 columns. Last 60 columns are not read in-

    col_classes = c(rep("character",2), rep("numeric", 3), rep("NULL",24), rep("numeric", 5), rep("NULL", 60))   

### Reading first 100 rows of the data

    data <- read.table(file, sep = " ",header = F, nrows = 100, na.strings ="", stringsAsFactors= F)

由于必须读入的文件在某些列之间有多个空格作为分隔符，因此上述方法不起作用。有什么方法可以让我们有效地读取这个文件。

score 108 · Accepted Answer

您需要更改分隔符。" "指一个空白字符。""将任何长度的空格称为分隔符

 data <- read.table(file, sep = "" , header = F , nrows = 100,
                     na.strings ="", stringsAsFactors= F)

从手册：

如果 sep = "" （read.table 的默认值）分隔符是“空白”，即一个或多个空格、制表符、换行符或回车符。

此外，对于大型数据文件，您可能需要考虑data.table:::fread将数据直接快速读取到 data.table 中。今天早上我自己在使用这个功能。它仍然是实验性的，但我发现它确实工作得很好。

score 8 · Accepted Answer

如果您想改用tidyverse（或readr分别）包，则可以read_table改用。

read_table(file, col_names = TRUE, col_types = NULL,
  locale = default_locale(), na = "NA", skip = 0, n_max = Inf,
  guess_max = min(n_max, 1000), progress = show_progress(), comment = "")

并在描述中看到这里：

read_table() and read_table2() are designed to read the type of textual data where
each column is #' separate by one (or more) columns of space.

score 3 · Accepted Answer

如果您的字段具有固定宽度，则应考虑使用read.fwf()which 可能更好地处理缺失值。

r - 在R中读取具有多个空格作为分隔符的文本文件

3 回答 3

Related

Reference