1

我想使用 R 提取所有带有左侧描述的疫苗表及其在表内的描述,

这是网页的链接

这是网页上第一个表格的外观:

在此处输入图像描述

我尝试使用 XML 包,但没有成功,我使用了:

vup<-readHTMLTable("https://milken-institute-covid-19-tracker.webflow.io/#vaccines_intro", which=5)

我收到一个错误:


Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘readHTMLTable’ for signature ‘&quot;NULL"’
In addition: Warning message:
XML content does not seem to be XML: '' 

这个怎么做?

4

1 回答 1

1

此网页不使用表格,因此是您错误的原因。由于多个小节和隐藏文本,页面上的格式非常复杂,需要单独查找感兴趣的节点。

我更喜欢使用“rvest”和“xml2”包来获得更简单、更直接的语法。
这不是一个完整的解决方案,应该让您朝着正确的方向前进。

library(rvest)
library(dplyr)

#find the top of the vacine section
parentvaccine <- page %>% html_node(xpath="//div[@id='vaccines_intro']") %>% xml_parent()

#find the vacine rows
vaccines <- parentvaccine %>% html_nodes(xpath = ".//div[@class='chart_row for_vaccines']")

#find info on each one
company <- vaccines %>% html_node(xpath = ".//div[@class='is_h5-2 is_developer w-richtext']") %>% html_text()
product <- vaccines %>% html_node(xpath = ".//div[@class='is_h5-2 is_vaccines w-richtext']") %>% html_text()
phase <- vaccines %>% html_node(xpath = ".//div[@class='is_h5-2 is_stage']") %>% html_text()
misc <- vaccines %>% html_node(xpath = ".//div[@class='chart_row-expanded for_vaccines']") %>% html_text()


#determine vacine type
#Get vacine type
vaccinetypes <- parentvaccine %>% html_nodes(xpath = './/div[@class="chart-section for_vaccines"]') %>% 
   html_node('div.is_h3') %>% html_text()
#dtermine the number of vacines in each category
lengthvector <-parentvaccine %>% html_nodes(xpath = './/div[@role="list"]') %>% xml_length() %>% sum()
#make vector of correct length
VaccineType <- rep(vaccinetypes, each=lengthvector)

answer <- data.frame(VaccineType,  company, product, phase)
head(answer)

要生成此代码,需要读取 html 代码并识别所需信息的正确节点和唯一属性。

于 2020-12-12T17:32:26.377 回答