html - 从 HTML 页面读取固定宽度格式的文本表格

Question

我正在尝试使用R从类似于以下http://www.fec.gov/pubrec/fe1996/hraz.htm的表中读取数据，但一直无法取得进展。我意识到要做到这一点，我需要使用 XML 和 RCurl，但尽管网络上有许多其他示例涉及类似问题，但我无法解决这个问题。

第一个问题是该表在查看时只是一个表，但没有这样编码。将其视为 xml 文档，我可以访问表中的“数据”，但是因为我想获得几个表，所以我不认为这是最优雅的解决方案。

将其视为 html 文档可能会更好，但我对 xpathApply 相对不熟悉，并且不知道如何获取表中的实际“数据”，因为它没有被任何东西括起来（即 i-/i 或 b-/乙）。

过去我使用 xml 文件取得了一些成功，但这是我第一次尝试使用 html 文件做类似的事情。尤其是这些文件的结构似乎比我见过的其他示例要少。

任何帮助深表感谢。

score 1 · Accepted Answer

假设您可以将html输出读入文本文件（相当于从您的 Web 浏览器复制+粘贴），这应该可以为您提供很大的帮助：

# x is the output from the website 


library(stringr)
library(data.table)

# First, remove commas from numbers (easiest to do at beginning)
x <- gsub(",([0-9])", "\\1", x)

# split the data by District
districts <- strsplit(x, "DISTRICT *")[[1]]

# separate out the header info
headerInfo <- districts[[1]]
districts <- tail(districts, -1)


# grab the straggling district number, use it as a name and remove it 

    # end of first line
    eofl <- str_locate(districts, "\n")[,2]

    # trim white space and assign as name
    names(districts) <- str_trim(substr(districts, 1, eofl))

    # remove first line
    districts <- substr(districts, eofl+1, nchar(districts))

# replace the ending '-------' and trime white space
    districts <- str_trim(str_replace_all(districts, "---*", ""))

# Adjust delimeter (this is the tricky part)

    ## more than two spaces are a spearator
    districts <- str_replace_all(districts, "  +", "\t")

    ## lines that are total tallies are missing two columns. 
    ##   thus, need to add two extra delims. After the first and third columns

        # this function will 
        padDelims <- function(section, splton) {
          # split into lines
          section <- strsplit(section, splton)[[1]]
          # identify lines starting with totals
          LinesToFix <- str_detect(section, "^Total")
          # pad appropriate columns
          section[LinesToFix] <- sub("(.+)\t(.+)\t(.*)?", "\\1\t\t\\2\t\t\\3", section[LinesToFix])

          # any rows missing delims, pad at end
          counts <- str_count(section, "\t")
          toadd  <- max(counts) - counts
          section[ ] <- mapply(function(s, p) if (p==0) return (s) else paste0(s, paste0(rep("\t", p), collapse="")), section, toadd) 

          # paste it back together and return
          paste(section, collapse=splton)
        }

    districts <- lapply(districts, padDelims, splton="\n")

    # reading the table and simultaneously addding the district column
    districtTables <- 
       lapply(names(districts), function(d) 
         data.table(read.table(text=districts[[d]], sep="\t"), district=d) )
    # ... or without adding district number: 
    ##       lapply(districts, function(d) data.table(read.table(text=d, sep="\t")))

    # flatten it 
    votes <- do.call(rbind, districtTables)
    setnames(votes, c("Candidate", "Party", "PrimVotes.Abs", "PrimVotes.Perc", "GeneralVotes.Abs", "GeneralVotes.Perc", "District") )

样品表：

 votes

                        Candidate      Party PrimVotes.Abs PrimVotes.Perc GeneralVotes.Abs GeneralVotes.Perc District
 1:                  Salmon, Matt          R         33672         100.00        135634.00             60.18        1
 2:            Total Party Votes:                    33672             NA               NA                NA        1
 3:                                                     NA             NA               NA                NA        1
 4:                     Cox, John     W(D)/D          1942         100.00         89738.00             39.82        1
 5:            Total Party Votes:                     1942             NA               NA                NA        1
 6:                                                     NA             NA               NA                NA        1
 7:         Total District Votes:                    35614             NA        225372.00                NA        1
 8:                    Pastor, Ed          D         29969         100.00         81982.00             65.01        2
 9:            Total Party Votes:                    29969             NA               NA                NA        2
10:                                                     NA             NA               NA                NA        2
...
51:                Hayworth, J.D.          R         32554         100.00        121431.00             47.57        6
52:            Total Party Votes:                    32554             NA               NA                NA        6
53:                                                     NA             NA               NA                NA        6
54:                  Owens, Steve          D         35137         100.00        118957.00             46.60        6
55:            Total Party Votes:                    35137             NA               NA                NA        6
56:                                                     NA             NA               NA                NA        6
57:              Anderson, Robert        LBT           148         100.00         14899.00              5.84        6
58:                                                     NA             NA               NA                NA        6
59:         Total District Votes:                    67839             NA        255287.00                NA        6
60:                                                     NA             NA               NA                NA        6
61:            Total State Votes:                   368185             NA       1356446.00                NA        6
                        Candidate      Party PrimVotes.Abs PrimVotes.Perc GeneralVotes.Abs GeneralVotes.Perc District

html - 从 HTML 页面读取固定宽度格式的文本表格

1 回答 1

Related

Reference