html - 使用 R 从动态网页中提取文本

Question

我正在使用本文中的数据编写数据准备教程：https ://www.nytimes.com/interactive/2021/01/19/upshot/trump-complete-insult-list.html#

没有任何文本是硬编码的，一切都是动态的，我不知道从哪里开始。我已经用rvest和xml2包尝试了一些东西，但我什至无法判断我是否正在取得进展。

我在记事本++中使用了复制/粘贴正则表达式来获得这样的表格结构：

目标	攻击
AAA新闻	假新闻
AAA新闻	假新闻
AAA新闻	彻底的耻辱
...	...
ZZZ先生	真正的疯狂工作

但我想展示如何以编程方式完成所有操作（不复制/粘贴）。

我的主要问题如下：通过合理的努力是否有可能？如果是这样，关于如何开始的任何线索？

PS：我知道这可能是重复的，我只是不知道是哪个问题，因为那里有完全不同的方法：\

score 2 · Accepted Answer

我在本月使用了我在《纽约时报》上的免费文章分配，但这里有一些指导。看起来该网页使用多个脚本来创建和显示该页面。

如果您使用浏览器的开发人员工具并查看网络选项卡，您会发现 2 个 CSV 文件：

tweets-full.csv 位于此处：https ://static01.nyt.com/newsgraphics/2021/01/10/trump-insult-complete/8afc02d17b32a573bf1ceed93a0ac21b232fba7a/tweets-full.csv
tweets-reduced.csv 位于此处：https ://static01.nyt.com/newsgraphics/2021/01/10/trump-insult-complete/8afc02d17b32a573bf1ceed93a0ac21b232fba7a/tweets-reduced.csv

看起来简化后的文件创建了上面引用的表格，而 tweets-full 是完整的推文。您可以直接下载这些文件，read.csv()并根据需要处理这些信息。

在抓取任何网页之前，请务必阅读服务条款。

score 1 · Accepted Answer

这是使用 RSelenium 和 rvest 的编程方法：

library(RSelenium)
library(rvest)
library(tidyverse)
driver <- rsDriver(browser="chrome", port=4234L, chromever ="87.0.4280.87")
client <- driver[["client"]]
client$navigate("https://www.nytimes.com/interactive/2021/01/19/upshot/trump-complete-insult-list.html#")
page.source <- client$getPageSource()[[1]]

#Extract nodes for each letter using XPath
Letters <- read_html(page.source) %>%
  html_nodes(xpath = '//*[@id="mem-wall"]/div[2]/div') 

#Extract Entities using CSS
Entities <- map(Letters, ~ html_nodes(.x, css = 'div.g-entity-name') %>%
                  html_text)

#Extract quotes using CSS
Quotes <- map(Letters, ~ html_nodes(.x, css = 'div.g-twitter-quote-container') %>%
                            map(html_nodes, css = 'div.g-twitter-quote-c') %>%
                            map(html_text))

#Bind the entites and quotes together. There are two letters that are blank, so fall back to NA
map2_dfr(Entities, Quotes,
         ~ map2_dfr(.x, .y,~ {if(length(.x) > 0 & length(.y)){data.frame(Entity = .x, Insult = .y)}else{
                                                        data.frame(Entity = NA, Insult = NA)}})) -> Result

#Strip out the quotes
Result %>%
  mutate(Insult = str_replace_all(Insult,"(^“)|([ .,!?]?”)","") %>% str_trim) -> Result

#Take a look at the result
Result %>%
  slice_sample(n=10)
                   Entity                                                              Insult
1             Mitt Romney                                       failed presidential candidate
2         Hillary Clinton                                                             Crooked
3  The “mainstream” media                                                           Fake News
4               Democrats                                             on a fishing expedition
5           Pete Ricketts                                             illegal late night coup
6  The “mainstream” media                                                   anti-Trump haters
7     The Washington Post do nothing but write bad stories even on very positive achievements
8               Democrats                                                                weak
9             Marco Rubio                                                         Lightweight
10     The Steele Dossier                                                      a Fake Dossier

xpath 是通过检查网页源（F9在 Chrome 中）获得的，将鼠标悬停在元素上直到突出显示正确的元素，右键单击并选择复制 XPath，如下所示：

html - 使用 R 从动态网页中提取文本

2 回答 2

Related

Reference