r - 使用 xml2 抓取 web 表的前两列

Question

我一直在努力使用 R 中的 xml 包，我需要一些帮助来使用 xml2 抓取一些格式良好的表。

我想抓取的表格第一页的 url 在这里。在某些页面上，我想要第二个和第三个表格，但在其他页面上，我想要第一和第二个。一个共同的线索是，我希望将所有“标题”标签包含“符合”文本的表格都抓取并存储在一个列表中，而“标题”标签包含“不符合任何”文本的表格。但我真的不知道该怎么做。我正在使用的代码如下。我可以想象必须有某种方法可以使正则表达式成为选择整个表的条件。希望代码有效。

#Define urls
urls<-lapply(seq(1,12, 1), function(x) paste('http://www.chemicalsubstanceschimiques.gc.ca/challenge-defi/batch-lot-',x,'/index-eng.php', sep=''))
#scrap the text
batches<-lapply(urls, function(x) read_html(x))
#Return the tables from each 
batches_tables<-lapply(batches, function(x) xml_find_all(x, './/table'))
#get the table from the first
out<-batches[[1]]
#Inspect
out[[1]] #do not want this table
out[[2]] #want this table pasted in one list, caption='that meet'
out[[2]] #want this table pasted in a second list, caption='that do not meet'

score 0 · Accepted Answer

caption使用contains()然后向上移动到父标签定位标签：

library(xml2)
library(rvest)

URL <- "http://www.chemicalsubstanceschimiques.gc.ca/challenge-defi/batch-lot-1/index-eng.php#s1"
pg <- read_html(URL)

html_nodes(pg, xpath=".//table/caption[contains(., 'that meet')]/..")
## {xml_nodeset (1)}
## [1] <table class="fontSize80">&#13;\n          <caption>&#13;\n          ...

html_nodes(pg, xpath=".//table/caption[contains(., 'that do not meet')]/..")
## {xml_nodeset (1)}
## [1] <table class="fontSize85">&#13;\n          <caption>&#13;\n          ...

r - 使用 xml2 抓取 web 表的前两列

1 回答 1

Related

Reference