r - 启动 Web 浏览器并复制包含 R 的信息

Question

我正在尝试找到一种从 PubMed 页面复制粘贴标题和摘要的方法。

我开始使用

browseURL("https://pubmed.ncbi.nlm.nih.gov/19592249") ## final numbers are the PMID

现在我找不到以txt方式获取标题和摘要的方法。我必须为多个 PMID 做这件事，所以我需要自动化它。它也很有用，只需复制该页面上的所有内容，然后我只能获取我需要的内容。有可能这样做吗？谢谢！

score 0 · Accepted Answer

我也会使用函数和 rvest。但是，我会使用 pid 作为参数函数，使用 html_node 因为只需要匹配一个节点，并使用更快的 css 选择器。字符串清理是通过 stringr 包完成的：

library(rvest)
library(stringr)
library(dplyr)

get_abstract <- function(pid){
  
  page <- read_html(paste0('https://pubmed.ncbi.nlm.nih.gov/', pid))
  
  df <-tibble(
    title = page %>% html_node('.heading-title') %>% html_text() %>% str_squish(),
    abstract = page %>% html_node('#enc-abstract') %>% html_text() %>% str_squish()
  )
  return(df)
}

get_abstract('19592249')

score 0 · Accepted Answer

我想你想做的是在 PubMed 上搜索感兴趣的文章？

rvest这是使用包执行此操作的一种方法：

#Required libraries.
library(magrittr)
library(rvest)

#Function.
getpubmed <- function(url){
  
  dat <- rvest::read_html(url)
  
  pid <- dat %>% html_elements(xpath = '//*[@title="PubMed ID"]') %>% html_text2() %>% unique()
  ptitle <- dat %>% html_elements(xpath = '//*[@class="heading-title"]') %>% html_text2() %>% unique()
  pabs <- dat %>% html_elements(xpath = '//*[@id="enc-abstract"]') %>% html_text2()
  
  return(data.frame(pubmed_id = pid, title = ptitle, abs = pabs, stringsAsFactors = FALSE))
  
}

#Test run.
urls <- c("https://pubmed.ncbi.nlm.nih.gov/19592249", "https://pubmed.ncbi.nlm.nih.gov/22281223/")

df <- do.call("rbind", lapply(urls, getpubmed))

代码应该是不言自明的。（为简洁起见，我没有添加df这里的内容。）该函数getpubmed不进行错误处理或任何类似的事情，但它是一个开始。通过向do.call("rbind", lapply(urls, getpubmed))构造提供 URL 向量，您可以返回data.frame由 PubMed ID、标题和摘要组成的列。

另一种选择是探索easyPubMed包。

r - 启动 Web 浏览器并复制包含 R 的信息

2 回答 2

Related

Reference