r - R：读取缺少最后一列的 fwf 文件

Question

我正在尝试使用readr's解析固定宽度的 .txt 文件read_fwf。大约有 150 万个观测值和大约 150 万个观测值。其中 550 个缺少 60 个变量中的最后 25 个。这种遗漏会导致对这些观察结果所具有的最终变量（下例中的“描述”）的不完美解析，并使数据框没有这些部分填充的列。

例如，

df_baseline <- read_fwf(file = file, fwf_widths(fwf_widths, fwf_names), 
                         col_types = col_types, trim_ws = T) %>% 
   mutate_all(na_if, "")
Warning: 1148 parsing failures.
row         col   expected     actual file
300495 description 240 chars  102        '/path/to/my/file/filename.txt'
300495 NA          59 columns 31 columns '/path/to/my/file/filename.txt' 
500245 description 240 chars  56         '/path/to/my/file/filename.txt' 
500245 NA          59 columns 31 columns '/path/to/my/file/filename.txt' 
500333 description 240 chars  33         '/path/to/my/file/filename.txt' 
See problems(...) for more details.

col_types是一行 60 个'c'符号的字符串，因此所有列都作为字符读入。fwf_widths并且fwf_names是建议的列宽和列标题的适当规范。

我知道，通过在 df 的最后一列中缺少值，我违反了文档的“固定宽度”性质。

有没有办法可以 1)read_fwf保留这些部分填充的行？2) 如果不是，我如何读取这个 txt 文件，因为它的 99% 可以根据正常的 FWF 进行解析？

score 1 · Accepted Answer

您可以使用data.table::fread(). 它会自动检测固定宽度格式，并使用该选项fill=TRUE应该给你你想要的：

#abc.txt
#a   b   c   d
#1   2   3   4
#1   2   3   4
#2   3
#1   4   3   2
library(data.table)
fread('abc.txt',fill = T)
#    a b  c  d
# 1: 1 2  3  4
# 2: 1 2  3  4
# 3: 2 3 NA NA
# 4: 1 4  3  2

score 0 · Accepted Answer

这个问题模棱两可，因此难以直接或准确回答，但 fwf 文件 ABCD.txt 说明了 OP 可能询问的三种情况：

# ABCD.txt
# 1ABCD
# 2AB
# 3AB D
# 4ABD
# 5ABCD
#

第 1 行和第 5 行没有缺失值，可以毫无问题地进行解析。

第2 行和第 3 行（第一个在三个值之后被截断；第二个在第四列中有一个空占位符）也可以由解析而没有问题read_fwf，尽管会有关于行中截断的警告（如 OP 引用的那样） 2（和第 4 行，我们将在下面处理）：

widths <- c(1,1,1,1,1)
file <- "ABCD.txt"

abc <- read_fwf(
  file = path,
  fwf_widths(widths),
  col_types = "ccccc"
  )

abc

输出：

Warning: 3 parsing failures.
row col  expected    actual       file
  2  X4 1 chars   0         'ABCD.txt'
  2  -- 5 columns 4 columns 'ABCD.txt'
  4  X5 1 chars   0         'ABCD.txt'

# A tibble: 5 x 5
  X1    X2    X3    X4    X5   
  <chr> <chr> <chr> <chr> <chr>
1 1     A     B     C     D    
2 2     A     B     NA    NA   
3 3     A     B     NA    D    
4 4     A     B     D     NA   
5 5     A     B     C     D

请注意，用缺失值read_fwf填充截断的行。NA

read.fwfwithfill = TRUE也可以，虽然它比较慢并且不会抛出任何警告：

abc <- read.fwf(
  path,
  widths =  widths,
  colClasses = "character",
  na.strings = c(" ","NA"),
  fill = TRUE
  )

abc

输出：

  V1 V2 V3   V4   V5
1  1  A  B    C    D
2  2  A  B <NA> <NA>
3  3  A  B <NA>    D
4  4  A  B    D <NA>
5  5  A  B    C    D

但是，如果我们知道该值D应该在第五列中，则任何一种方法都不会“正确”解析第 4 行。（但是，没有办法read_fwf或read.fwf知道这一点，所以严格来说，没有解析错误。）

有多种方法可以处理此问题，但如果问题在整个 fwf 文件中是一致的（例如，在所有此类情况下，60 个变量中的最后 25 个都缺失，如问题中所述），那么一种解决方案是dplyr使用将放置不正确的D值从第四列转置到第五列（或在 OP 的情况下从第 35 列到第 60 列）：

abc <- abc %>%
  mutate(
    V5 = case_when(
      is.na(V5) & !is.na(V4) ~ V4,
      !is.na(V5) ~ V5
      ),
    V4 = case_when(
      V4!=V5 ~ V4
      )
    )

abc

输出：

  V1 V2 V3   V4   V5
1  1  A  B    C    D
2  2  A  B <NA> <NA>
3  3  A  B <NA>    D
4  4  A  B <NA>    D
5  5  A  B    C    D

r - R：读取缺少最后一列的 fwf 文件

2 回答 2

Related

Reference