html - 如何使用 R 从网页中检索多个表

Question

我想使用 R 提取所有带有左侧描述的疫苗表及其在表内的描述，

这是网页的链接

这是网页上第一个表格的外观：

我尝试使用 XML 包，但没有成功，我使用了：

vup<-readHTMLTable("https://milken-institute-covid-19-tracker.webflow.io/#vaccines_intro", which=5)

我收到一个错误：


Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘readHTMLTable’ for signature ‘&quot;NULL"’
In addition: Warning message:
XML content does not seem to be XML: ''

这个怎么做？

score 1 · Accepted Answer

此网页不使用表格，因此是您错误的原因。由于多个小节和隐藏文本，页面上的格式非常复杂，需要单独查找感兴趣的节点。

我更喜欢使用“rvest”和“xml2”包来获得更简单、更直接的语法。
这不是一个完整的解决方案，应该让您朝着正确的方向前进。

library(rvest)
library(dplyr)

#find the top of the vacine section
parentvaccine <- page %>% html_node(xpath="//div[@id='vaccines_intro']") %>% xml_parent()

#find the vacine rows
vaccines <- parentvaccine %>% html_nodes(xpath = ".//div[@class='chart_row for_vaccines']")

#find info on each one
company <- vaccines %>% html_node(xpath = ".//div[@class='is_h5-2 is_developer w-richtext']") %>% html_text()
product <- vaccines %>% html_node(xpath = ".//div[@class='is_h5-2 is_vaccines w-richtext']") %>% html_text()
phase <- vaccines %>% html_node(xpath = ".//div[@class='is_h5-2 is_stage']") %>% html_text()
misc <- vaccines %>% html_node(xpath = ".//div[@class='chart_row-expanded for_vaccines']") %>% html_text()


#determine vacine type
#Get vacine type
vaccinetypes <- parentvaccine %>% html_nodes(xpath = './/div[@class="chart-section for_vaccines"]') %>% 
   html_node('div.is_h3') %>% html_text()
#dtermine the number of vacines in each category
lengthvector <-parentvaccine %>% html_nodes(xpath = './/div[@role="list"]') %>% xml_length() %>% sum()
#make vector of correct length
VaccineType <- rep(vaccinetypes, each=lengthvector)

answer <- data.frame(VaccineType,  company, product, phase)
head(answer)

要生成此代码，需要读取 html 代码并识别所需信息的正确节点和唯一属性。

html - 如何使用 R 从网页中检索多个表

1 回答 1

Related

Reference