javascript - 使用 R 将字段添加到在线表单并抓取生成的 javascript 创建表

Question

我正在尝试让 R 使用预定义的文本（例如 BN1 1NA）完成此网页http://cti.voa.gov.uk/cti/上的“按邮政编码搜索”字段，前进到下一页并抓取结果4 列表，根据邮政编码，可以跨越多页。为了使其更复杂，“改进指标”不是文本字段，而是图像文件（如使用邮政编码 BN1 3HP 搜索时所见）。我希望此列包含 0 或 1，具体取决于图像是否存在。

最终，我追求了一个很好的数据框，它反映了屏幕上的 4 列。

我试图修改这个问题的建议来做我上面描述的事情，但没有运气，老实说，我试图破译这个问题已经超出了我的深度。

我意识到 R 可能不是最适合我需要做的事情，但它是我可用的全部。任何帮助将不胜感激。

score 5 · Accepted Answer

我不确定美国之音网站的 T&C 对抓取有什么规定，但这段代码可以完成这项工作：

library("httr")
library("rvest")
post_code <- "B1 1"
resp <- POST("http://cti.voa.gov.uk/cti/InitS.asp?lcn=0",
             encode = "form",
             body = list(btnPush = 1,
                         txtPageNum = 0,
                         txtPostCode = post_code,
                         txtRedirectTo = "InitS.asp",
                         txtStartKey = 0))
resp_cont <- read_html(resp)
council_table <- resp_cont %>%
  html_node(".scl_complex table") %>%
  html_table

Firebug有一个出色的“网络”面板，可以在其中看到 POST 标头。大多数现代浏览器也内置了类似的东西。

score 4 · Accepted Answer

我使用 RSelenium 来废弃埃克塞特邮政编码的市政税清单：

library(RSelenium)
library(RCurl)
input = 'EX4 2NU'
appURL <- "http://cti.voa.gov.uk/cti/"
RSelenium::startServer()
remDr <- remoteDriver()
remDr$open()
Sys.sleep(5)
remDr$navigate(appURL)
search.form <- remDr$findElement(using = "xpath", "//*[@id='txtPostCode']")
search.form$sendKeysToElement(list(input, key = "enter"))
doc <- remDr$getPageSource()
tbl = xpathSApply(htmlParse(doc[[1]]),'//tbody')
temp1 = readHTMLTable(tbl[[1]],header=F)

v = length(xpathSApply(htmlParse(doc[[1]]),'//a[@class="next"]'))
while (v != 0) {
    nextpage <- remDr$findElement(using = "xpath", "//*[@class = 'next']")
    nextpage$clickElement()
    doc <- remDr$getPageSource()
    tbl = xpathSApply(htmlParse(doc[[1]]),'//tbody')
    temp2 = readHTMLTable(tbl[[1]],header=F)
    temp1 = rbind(temp1,temp2)
    v = length(xpathSApply(htmlParse(doc[[1]]),'//a[@class="next"]'))
}
finaltable = temp1

希望对您有所帮助。有了这个，您可以废弃多个页面数据。

javascript - 使用 R 将字段添加到在线表单并抓取生成的 javascript 创建表

2 回答 2

Related

Reference