r - 这个数据集有什么问题？

Question

我正在学习 R，我正在尝试这个数据集。 http://ww2.amstat.org/publications/jse/datasets/airport.dat.txt

不幸的是，使用

ap <- read.table("http://ww2.amstat.org/publications/jse/datasets/airport.dat.txt")

确实给出了错误的结果。该文件是此处所述的“自由格式输入文件”。（http://data.princeton.edu/R/readingData.html）。按照该页面上给出的示例，我的简单代码应该可以工作..但它不能并导致断线和错误条目。怎么了？

谢谢你。

score 1 · Accepted Answer

您必须像这样使用read.fwf和指定widths：

read.fwf("http://ww2.amstat.org/publications/jse/datasets/airport.dat.txt",
 widths=c(21,21,7,7,9,10,15))

                       V1                    V2      V3     V4       V5        V6        V7
1   HARTSFIELD INTL       ATLANTA                285693 288803 22665665 165668.76  93039.48
2   BALTO/WASH INTL       BALTIMORE               73300  74048  4420425  18041.52  19722.93
3   LOGAN INTL            BOSTON                 114153 115524  9549585 127815.09  29785.72
4   DOUGLAS MUNI          CHARLOTTE              120210 121798  7076954  36242.84  15399.46

score 0 · Accepted Answer

读取固定宽度的文件总是一个挑战，因为用户需要计算出每列的宽度。为了完成这样的任务，我使用函数 fromreadr使过程更容易。

读取固定宽度文件的主要功能是read_fwf. 此外，还有一个功能叫做fwf_empty可以帮助用户“猜”出列宽。但是这个函数可能并不总是正确地识别列宽。这是一个例子。

# Load package
library(readr)

# Read the data
filepath <- "http://ww2.amstat.org/publications/jse/datasets/airport.dat.txt"

# Guess based on position of empty columns
col_pos <- fwf_empty(filepath)

# Read the data
dat <- read_fwf(filepath, col_positions = col_pos)

# Check the data frame
head(dat) 

# A tibble: 6 × 6
               X1                           X2     X3       X4        X5        X6
            <chr>                        <chr>  <int>    <int>     <dbl>     <dbl>
1 HARTSFIELD INTL ATLANTA               285693 288803 22665665 165668.76  93039.48
2 BALTO/WASH INTL BALTIMORE              73300  74048  4420425  18041.52  19722.93
3      LOGAN INTL BOSTON                114153 115524  9549585 127815.09  29785.72
4    DOUGLAS MUNI CHARLOTTE             120210 121798  7076954  36242.84  15399.46
5          MIDWAY CHICAGO                64465  66389  3547040   4494.78   4485.58
6     O'HARE INTL CHICAGO               322430 332338 25636383 300463.80 140359.38

fwf_empty可以很好地识别除第 2 列和第 3 列之外的所有列。它假定它们来自同一列。所以我们需要一些额外的工作。

的输出fwf_empty是一个包含 4 个元素的列表，显示了已识别的开始和结束位置、跳过和列名。我们必须更新开始和结束位置以说明第 2 列和第 3 列的存在。

# Extract the begin position
Begin <- col_pos$begin

# Extract the end position
End <- col_pos$end

# Update the position information
Begin <- c(Begin[1:2], 43, Begin[3:6])
End <- c(End[1], 42, End[2:6])

# Update col_pos
col_pos$begin <- Begin
col_pos$end <- End
col_pos$col_names <- paste0("X", 1:7)

现在我们再次读取数据。

dat2 <- read_fwf(filepath, col_positions = col_pos)
head(dat2)

# A tibble: 6 × 7
               X1        X2     X3     X4       X5        X6        X7
            <chr>     <chr>  <int>  <int>    <int>     <dbl>     <dbl>
1 HARTSFIELD INTL   ATLANTA 285693 288803 22665665 165668.76  93039.48
2 BALTO/WASH INTL BALTIMORE  73300  74048  4420425  18041.52  19722.93
3      LOGAN INTL    BOSTON 114153 115524  9549585 127815.09  29785.72
4    DOUGLAS MUNI CHARLOTTE 120210 121798  7076954  36242.84  15399.46
5          MIDWAY   CHICAGO  64465  66389  3547040   4494.78   4485.58
6     O'HARE INTL   CHICAGO 322430 332338 25636383 300463.80 140359.38

这次read_fwf函数可以成功读取文件。

r - 这个数据集有什么问题？

2 回答 2

Related

Reference