python - 如何使用 R 或 Python 通过 Google Scholar 查询下载学术论文的 PDF

Question

我有一份需要下载的学术论文标题列表。我想写一个循环来从网上下载他们的 PDF 文件，但找不到办法。

这是我到目前为止所想的分步说明（欢迎使用 R 或 Python 来回答）：

# Create list with paper titles (example with 4 papers from different journals)
titles <- c("Effect of interfacial properties on polymer–nanocrystal thermoelectric transport",
            "Reducing social and environmental impacts of urban freight transport: A review of some major cities",
            "Using Lorenz curves to assess public transport equity",
            "Green infrastructure: The effects of urban rail transit on air quality")

#Loop step1 - Query paper title in Google Scholar to get URL of journal webpage containing the paper
#Loop step2 - Download the PDF from the journal webpage and save in your computer

for (i in titles){
                  journal_URL <- query i in google (scholar)
                  download.file (url = journal_URL, pattern = "pdf",
                                 destfile=paste0(i,".pdf")                      
                 }

并发症：

Loop step1 - Google Scholar 的第一个点击应该是论文的原始 URL。但是，我听说 Google Scholar 对 Bots 有点挑剔，所以另一种方法是查询 Google 并获取第一个 URL（跳跃它会带来正确的 URL）

循环步骤 2 - 一些论文是封闭的，所以我想有必要包含身份验证信息（user=__，passwd=__）。但是，如果我使用我的大学网络，则此身份验证应该是自动的，对吗？

附言。我只需要下载PDF。我对获取文献计量信息（例如引文记录、h-index）不感兴趣。为了获取文献计量数据，这里 (R 用户)和这里 (python 用户)有一些指导。

score 5 · Accepted Answer

Crossref 有一个程序，出版商可以在该程序中为文章全文版本的链接提供元数据。不幸的是，对于 Wiley、Elsevier 和 Springer 等出版商，他们可能会提供链接，但您需要额外的权限才能实际检索内容。好玩吧？无论如何，有些工作，例如，这适用于您的第二个标题，搜索交叉引用，然后获取全文的 URL（如果提供），然后抓取 xml，（比 PDF 恕我直言）

titles <- c("Effect of interfacial properties on polymer–nanocrystal thermoelectric transport", "Reducing social and environmental impacts of urban freight transport: A review of some major cities", "Using Lorenz curves to assess public transport equity", "Green infrastructure: The effects of urban rail transit on air quality")

library("rcrossref")
out <- cr_search(titles[2])
doi <- sub("http://dx.doi.org/", "", out$doi[1])
(links <- cr_ft_links(doi, "all"))
$xml
<url> http://api.elsevier.com/content/article/PII:S1877042812005551?httpAccept=text/xml

$plain
<url> http://api.elsevier.com/content/article/PII:S1877042812005551?httpAccept=text/plain

xml <- cr_ft_text(links, "xml")
library("XML")
xpathApply(xml, "//ce:author")[[1]]
<ce:author>
   <ce:degrees>Prof</ce:degrees>
   <ce:given-name>Eiichi</ce:given-name>
   <ce:surname>Taniguchi</ce:surname>
</ce:author>

python - 如何使用 R 或 Python 通过 Google Scholar 查询下载学术论文的 PDF

1 回答 1

Related

Reference