xml - 抓取网页，页面上的链接，并用 R 形成表格

Question

您好，我是使用 R 从 Internet 上抓取数据的新手，遗憾的是，我对 HTML 和 XML 知之甚少。我试图在以下父页面上抓取每个故事链接：http: //www.who.int/csr/don/archive/year/2013/en/index.html。我不关心父页面上的任何其他链接，但需要创建一个表格，其中每个故事 URL 有一行，相应 URL 的列、故事标题、日期（它总是在开头故事标题后的第一句话），然后是页面的其余文本（可以是几段文本）。

我试图在为“周期表”和所有链接（以及几个相关线程）的 Scraping a wiki page 中调整代码，但遇到了困难。任何建议或指示将不胜感激。到目前为止，这是我尝试过的（使用“???????”，我遇到了麻烦）：

rm(list=ls())
library(XML)
library(plyr) 

url = 'http://www.who.int/csr/don/archive/year/2013/en/index.html'
doc <- htmlParse(url)

links = getNodeSet(doc, ?????)

df = ldply(doc, function(x) {
  text = xmlValue(x)
  if (text=='') text=NULL

  symbol = xmlGetAttr(x, '?????')
  link = xmlGetAttr(x, 'href')
  if (!is.null(text) & !is.null(symbol) & !is.null(link))
    data.frame(symbol, text, link)
} )

df = head(df, ?????)

score 7 · Accepted Answer

给定 Xpath，您可以xpathSApply(lapply 等效) 在您的文档中进行搜索。

library(XML)
url = 'http://www.who.int/csr/don/archive/year/2013/en/index.html'
doc <- htmlParse(url)
data.frame(
  dates =  xpathSApply(doc, '//*[@class="auto_archive"]/li/a',xmlValue),
  hrefs = xpathSApply(doc, '//*[@class="auto_archive"]/li/a',xmlGetAttr,'href'),
  story = xpathSApply(doc, '//*[@class="link_info"]/text()',xmlValue))

 ##               dates                                                hrefs
## 1      26 June 2013             /entity/csr/don/2013_06_26/en/index.html
## 2      23 June 2013             /entity/csr/don/2013_06_23/en/index.html
## 3      22 June 2013             /entity/csr/don/2013_06_22/en/index.html
## 4      17 June 2013             /entity/csr/don/2013_06_17/en/index.html

##                                                                                    story
## 1                       Middle East respiratory syndrome coronavirus (MERS-CoV) - update
## 2                       Middle East respiratory syndrome coronavirus (MERS-CoV) - update
## 3                       Middle East respiratory syndrome coronavirus (MERS-CoV) - update
## 4                       Middle East respiratory syndrome coronavirus (MERS-CoV) - update

编辑：添加每个故事的文字

dat$text = unlist(lapply(dat$hrefs,function(x)
  {
    url.story <- gsub('/entity','http://www.who.int',x)
    texts <- xpathSApply(htmlParse(url.story), 
                         '//*[@id="primary"]',xmlValue)
    }))

xml - 抓取网页，页面上的链接，并用 R 形成表格

1 回答 1

编辑：添加每个故事的文字

Related

Reference