r - R：XPath 表达式返回所选元素之外的链接

Question

我正在使用 R使用 XPath 语法从该页面上的主表中抓取链接。主表是页面上的第三个，我只想要包含杂志文章的链接。

我的代码如下：

require(XML)
(x = htmlParse("http://www.numerama.com/magazine/recherche/125/hadopi/date"))
(y = xpathApply(x, "//table")[[3]])
(z = xpathApply(y, "//table//a[contains(@href,'/magazine/') and not(contains(@href, '/recherche/'))]/@href"))
(links = unique(z))

如果您查看输出，最终链接不是来自主表，而是来自侧边栏，即使我在第三行通过要求对象y仅包含第三个表来选择主表。

我究竟做错了什么？使用 XPath 进行编码的正确/更有效的方法是什么？

注：XPath 菜鸟写法。

回答（真的很快），非常感谢！我的解决方案如下。

extract <- function(x) {
    message(x)
    html = htmlParse(paste0("http://www.numerama.com/magazine/recherche/", x, "/hadopi/date"))
    html = xpathApply(html, "//table")[[3]]
    html = xpathApply(html, ".//a[contains(@href,'/magazine/') and not(contains(@href, '/recherche/'))]/@href")
    html = gsub("#ac_newscomment", "", html)
    html = unique(html)
}

d = lapply(1:125, extract)
d = unlist(d)
write.table(d, "numerama.hadopi.news.txt", row.names = FALSE)

这将保存所有指向本网站上带有关键字“Hadopi”的新闻项目的链接。

score 4 · Accepted Answer

.如果要将搜索限制在当前节点，则需要启动模式。/回到文档的开头（即使根节点不在y）。

xpathSApply(y, ".//a/@href" )

或者，您可以直接使用 XPath 提取第三个表：

xpathApply(x, "//table[3]//a[contains(@href,'/magazine/') and not(contains(@href, '/recherche/'))]/@href")

r - R：XPath 表达式返回所选元素之外的链接

1 回答 1

Related

Reference