r - R中的txt文件导入问题

Question

问题是txt文件本身的格式似乎很差，但是一旦它们在R环境中将数据转换为可行的格式（即每列一个条目的数据框），我就无法工作。

基本 R 函数将read.delim我的文本文件作为单列导入（忽略分隔符，我不确定它是制表符还是空格）。我努力了：

indivs<-lapply(files, read.delim, sep="\t", header=T, na.strings = "NA")

它给出了所描述的不希望的结果（所有值由空格或制表符分隔的单列作为长字符串）

我也试过：

indivs<-lapply(files, read.delim, sep=" ", header=T, na.strings = "NA")

抛出：

Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
  more columns than column names

所以我认为至少第一个选项将文件放入 R 中，我可以从那里开始......

数据来自 GPS 跟踪器，这些跟踪器很容易得到错误读数，这会导致列号/标题差异，因为当它不确定位置时它不会给出尽可能多的值（另请参见最后的数据结构）。它给出了这样的条目：

356   356  NotEnoughSats   0/2  19/12/12  13:40:11

完整条目如下所示：

357   357          Valid   5/6  19/12/12  13:50:11  19/12/12  13:48:33.831    -97.169    -23.44309    151.91783        35.04    10.8         0.9             0.0        0.00

使用我实际上设法导入我一直在尝试的文件的方法，结合dplyr::filter并grepl删除带有错误读数的行，使我得到正确数量的标题名称和条目，从而允许正确read.delim执行

我正在处理不同跟踪器的数据框列表，所以我希望有人能找到一种方法来使用lapply函数或类似方法来应用求解，例如（不运行）：

    cleaned.txts<-lapply(indivs, function(x){
  x%>%
    filter(grepl("Valid ", .))
})

这是来自其中一个数据框的示例：

    > head(indivs[1])
[[1]]
    Index.........Status..Sats..RTC.date..RTC.time..FIX.date......FIX.time...Delta.s......Latitude....Longitude..Altitude.m.....HDOP........eRes..Temperature.C...Voltage.V.
1       1  NotEnoughSats   0/0  19/12/10  02:30:06                            -81.140                                                                        0.0        0.00
2       2  NotEnoughSats   0/0  19/12/10  02:40:06                            -81.160                                                                        0.0        0.00
3       3  NotEnoughSats   0/2  19/12/10  02:50:08                            -81.180                                                                        0.0        0.00
4       4  NotEnoughSats   0/2  19/12/10  03:00:08                            -81.200                                                                        0.0        0.00
5       5  NotEnoughSats   0/1  19/12/10  03:10:08                            -81.220                                                                        0.0        0.00
6       6  NotEnoughSats   0/0  19/12/10  03:20:06                            -81.240                                                                        0.0        0.00
7       7  NotEnoughSats   0/2  19/12/10  03:30:08                            -81.260                                                                        0.0        0.00
8       8          Valid   3/3  19/12/10  03:40:11  19/12/10  03:38:49.720    -81.280    -23.44205    151.91308        30.00     2.9         0.0             0.0        0.00
9       9  NotEnoughSats   0/1  19/12/10  03:50:08                            -81.300                                                                        0.0        0.00
10     10  NotEnoughSats   0/0  19/12/10  04:00:06                            -81.320                                                                        0.0        0.00
11     11  NotEnoughSats   0/0  19/12/10  04:10:06                            -81.340                                                                        0.0        0.00
12     12  NotEnoughSats   0/2  19/12/10  04:20:08                            -81.360                                                                        0.0        0.00
13     13  NotEnoughSats   0/2  19/12/10  04:30:08                            -81.380                                                                        0.0        0.00
14     14  NotEnoughSats   0/1  19/12/10  04:40:08                            -81.400                                                                        0.0        0.00
15     15  NotEnoughSats   0/1  19/12/10  04:50:08                            -81.420                                                                        0.0        0.00
16     16  NotEnoughSats   0/1  19/12/10  05:00:08                            -81.440                                                                        0.0        0.00
17     17  NotEnoughSats   0/2  19/12/10  05:10:08                            -81.460                                                                        0.0        0.00
18     18  NotEnoughSats   0/2  19/12/10  05:20:08                            -81.480                                                                        0.0        0.00
19     19  NotEnoughSats   0/1  19/12/10  05:30:08                            -81.500                                                                        0.0        0.00
20     20  NotEnoughSats   0/1  19/12/10  05:40:08                            -81.520                                                                        0.0        0.00
21     21  NotEnoughSats   0/1  19/12/10  05:50:08                            -81.540                                                                        0.0        0.00
22     22          Valid   5/5  19/12/10  06:00:11  19/12/10  05:58:49.467    -81.533    -23.44350    151.91756        58.28     1.5         0.8             0.0        0.00
23     23  NotEnoughSats   0/1  19/12/10  06:10:08                            -81.580                                                                        0.0        0.00
24     24          Valid   3/3  19/12/10  06:20:11  19/12/10  06:18:49.400    -81.600    -23.43780    151.92362        58.35   219.5         0.0             0.0        0.00
25     25  NotEnoughSats   0/1  19/12/10  06:30:08                            -81.720                                                                        0.0        0.00

score 1 · Accepted Answer

// 更新：
OP 的问题通过以下方式简单解决：

indivs <- lapply(files, read.table, sep="", header=T, fill=T)
indiv2 <- lapply(indivs, filter, Status=="Valid")

原始答案：

这行得通吗？
它对我有用，假设当您的跟踪器具有 GPS 读数时，最后添加了额外的列。map_dfr否则，您可能希望在每个文件中修复 colnames 。

library(dplyr)
library(purrr)

# only get the content of your files
files_content <- file_ls %>%
    map_dfr(~suppressWarnings(read.table(., sep='', header=F, skip=1, fill=T, na.strings = '')))

# only get the headers and keep the longest one
files_headers <- file_ls %>%
    map(~read.table(., sep='', header=F, nrows=1, na.strings = '')) %>%
    .[[which.max(sapply(., length))]]

# rename the columns with that header
files_final <- files_content %>%
    rename_with(.fn = ~as.character(files_headers[.x]), .cols = names(files_headers))

// 更新：
考虑到多行数据的问题，这是一个返工。这一次，代码逐行读取每个文件，然后real_line_id根据是否找到Valid或来分配 a NotEnoughSats。然后，我们将文件中奇怪拆分的行粘合在一起，然后才解析这些行。

library(readr)
library(tibble)
library(tidyr)
library(stringr)
library(dplyr)
library(purrr)

files_headers <- file_ls %>%
  map(~read.table(., sep='', header=F, nrows=1, na.strings = '')) %>%
  .[[which.max(sapply(., length))]] %>%
  as.character()

files_final <- file_ls %>%
  map_dfr(
    ~ tibble(
        line_raw = read_lines(., skip = 1)
      ) %>%
      mutate(
        validity = str_extract(line_raw, 'NotEnoughSats|Valid')
      ) %>%
      group_by(real_line_i = cumsum(!is.na(validity))) %>%
      summarise(
        parsed_line = paste(line_raw, collapse = ' ') %>%
          map(
            ~ strsplit(., split = '\\s+') %>%
              unlist() %>%
              setNames(., files_headers[seq_along(.)]) %>%
              as_tibble_row(.name_repair = 'universal')
          ),
        .groups = 'drop'
      ) %>%
      unnest(parsed_line)
  )

r - R中的txt文件导入问题

1 回答 1

Related

Reference