r - 我怎样才能抓取这些数据？

Question

我想从这个页面抓取统计数据：

url <- "http://www.pgatour.com/players/player.20098.stuart-appleby.html/statistics"

具体来说，我想获取 Stuart 头像下方的表格中的数据。标题为“Stuart Appleby - 2015 STATS PGA TOUR”

我尝试将rvest, 与 Selector Gadget ( http://selectorgadget.com/ ) 结合使用。

url_html <- url %>% html()
url_html %>% 
        html_nodes(xpath = '//*[(@id = "playerStats")]//td')

'应该'让我得到表格，例如，顶部没有显示“Recap - Rank - Additional Stats”的行

url_html <- url %>% html()
url_html %>% 
    html_nodes(xpath = '//*[(@id = "playerStats")] | //th//*[(@id = "playerStats")]//td')

'应该'让我得到带有“Recap - Rank - Add'l Stats”行的表格。

也不行。

Obvs 在网络抓取方面，我完全是新手。当我单击该网页的“查看源代码”时，表格中包含的数据不存在。

在源代码中，我认为应该从表开始，是这段代码：

<script id="playerStatsTourTemplate" type="text/x-jquery-tmpl">
    {{each(t, tour) tours}}
        {{if pgatour.players.shouldProcessTour(tour.tourCodeLC)}}
        <div class="statistics-head">
            <h2 class="title">Stuart&nbsp;Appleby - <b>${year} STATS 
.
.
.

因此，该表似乎存储在函数无法访问的某个地方（Json？Jquery？Javascript？这些术语是否适用于这里？）html()。有没有办法rvest用来获取这些数据？是否有rvest用于抓取以这种方式存储的数据的等价物？

谢谢。

score 2 · Accepted Answer

我可能会使用页面发出的 GET 请求从他们的 API 获取原始数据并进行解析......

content(a)给你一个列表表示......基本上是输出fromJSON()
或
as(a, "character")给你原始 JSON

library("httr")
a <- GET("http://www.pgatour.com/data/players/20098/2014stat.json")
content(a)
as(a, "character")

score 1 · Accepted Answer

看一下这个。

GitHub 上的开源项目抓取 PGA 数据：https ://github.com/zachwill/golf/blob/master/pga.py

r - 我怎样才能抓取这些数据？

2 回答 2

Related

Reference