2

For a while now, I have been using R and the package RCurl to automatically download information from a webpage; I normally use simple functions like getURL(), getForm() and postForm(). I usually just find the HTML parameters optional values and fill them. However, I came across a webpage which I think cannot be downloaded using the traditional functions because I cannot find any parameters in the url address. I believe this is happening because the webpage is written in javascript and I don't know how to deal with it. I am a mathematician with vast experience using R but with a very basic knowledge of HTML and no knowledge at all of javascript.

I don't necessarily need to use R directly, I could use other software and then import it from R. I have found a Mozilla application called mozrepl but I was unable to make it work. I would appreciate if someone with more experience could help me with a solution, whether using different software or putting the appropriate commands in R or mozrepl. If it is not possible to download the info directly to an R variable it would be ok to save it to a text file.

The information I want to download is produced after selecting a date value in the following url and then hitting the button called "Consultar TIIE". A table is produced with the variables "Posturas", "Montos" and "Participantes".

http://www.banxico.org.mx/tiieban/leeArgumentos.faces?BMXC_plazo=28&BMXC_semanas=4

I am doing this because my final objective is to put the information together into a dataframe.

4

2 回答 2

1

这里的javascript没有问题。javascript 简单地创建日历,因此您可以选择提交到表单的日期。然而,还有很多其他的问题。

在服务器端,他们似乎试图检测到没有浏览器尝试提取数据。此外,一旦正确提交了导致问题的表单,他们就会进行重定向。

require(RCurl)
require(XML)

appDate <- "20130502"
rURL <- "http://www.banxico.org.mx/tiieban/leeArgumentos.faces?BMXC_plazo=28&BMXC_semanas=4"
usera <- "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:21.0) Gecko/20100101 Firefox/21.0"
curl <- getCurlHandle(cookiefile = "", verbose = TRUE, useragent = usera
                      , followlocation = TRUE, autoreferer = TRUE, postredir = 2
                      , httpheader = c(Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
                                       "Accept-Encoding" = "gzip, deflate"
                                       , "Accept-Language" =  "en-US,en;q=0.5"
                                       , Connection = "keep-alive"), referer = "http://www.banxico.org.mx/tiieban/leeArgumentos.faces")

txt <- getURLContent(rURL, curl = curl, verbose = TRUE)
fParams <- structure(c(appDate, "Consultar+TIIE", "leeArgumentos")
                     ,.Names = c( "leeArgumentos%3Afecha", "leeArgumentos%3Asubmit", "leeArgumentos"))

res <- postForm(rURL, .params = fParams, style = "post", curl = curl, binary = TRUE)
xRes <- htmlParse(rawToChar(res))
readHTMLTable(getNodeSet(xRes, "//*/table")[[3]])

  Posturas Montos                      Participantes
1   4.3100    350 Banco Credit Suisse (México), S.A.
2   4.3245    350                 Banco Inbursa S.A.
3   4.3200    350                   Banco Invex S.A.
4   4.3375    350     Banco Mercantil del Norte S.A.
5   4.3350    350      Banco Nacional de México S.A.
6   4.3250    350                   HSBC México S.A.
7   4.3300    350          ScotiaBank Inverlat, S.A.

有很多事情正在发生。表单的参数需要编码。leeArgumentos:fecha需要leeArgumentos%3Afecha例如。可能会检测到用户代理以及引荐来源字符串和各种其他标头。

于 2013-06-18T14:37:06.733 回答
0

这看起来确实像一个 javascript 问题,而不是与 R 中的网络抓取直接相关的问题。

有多种方法可以解决此问题,您可以查看Scraping Javascript generated data以及Language for web scraping JAVASCRIPT content中的建议

您指向的示例似乎运行一个自定义脚本,show_calendar2在这里定义http://www.banxico.org.mx/tiieban/scripts/ts_picker2.js

于 2013-06-17T17:14:02.413 回答