python - 如何在 python 爬虫中访问具有多个页面的表单的 pubmed 数据

Question

我正在尝试使用 python 抓取 pubmed 并获取一篇文章引用的所有论文的 pubmed ID。

例如这篇文章（ID：11825149） http://www.ncbi.nlm.nih.gov/pubmed/11825149 有一个页面链接到所有引用它的文章： http ://www.ncbi.nlm.nih.gov /pubmed?linkname=pubmed_pubmed_citedin&from_uid=11825149 问题是它有超过 200 个链接，但每页只显示 20 个。url 无法访问“下一页”链接。

有没有办法打开“发送到”选项或使用 python 查看下一页的内容？

我目前如何打开 pubmed 页面：

def start(seed):
    webpage = urlopen(seed).read()
    print webpage


    citedByPage = urlopen('http://www.ncbi.nlm.nih.gov/pubmedlinkname=pubmed_pubmed_citedin&from_uid=' + pageid).read()
    print citedByPage

从中我可以提取第一页上链接引用的所有内容，但是如何从所有页面中提取它们？谢谢。

score 3 · Accepted Answer

我能够使用此页面 http://www.bio-cloud.info/Biopython/en/ch8.html中的方法获取 ID 引用

Back in Section 8.7 we mentioned ELink can be used to search for citations of a given paper. Unfortunately this only covers journals indexed for PubMed Central (doing it for all the journals in PubMed would mean a lot more work for the NIH). Let’s try this for the Biopython PDB parser paper, PubMed ID 14630660:

>>> from Bio import Entrez
>>> Entrez.email = "A.N.Other@example.com"
>>> pmid = "14630660"
>>> results = Entrez.read(Entrez.elink(dbfrom="pubmed", db="pmc",
...                                    LinkName="pubmed_pmc_refs", from_uid=pmid))
>>> pmc_ids = [link["Id"] for link in results[0]["LinkSetDb"][0]["Link"]]
>>> pmc_ids
['2744707', '2705363', '2682512', ..., '1190160']
Great - eleven articles. But why hasn’t the Biopython application note been found (PubMed ID 19304878)? Well, as you might have guessed from the variable names, there are not actually PubMed IDs, but PubMed Central IDs. Our application note is the third citing paper in that list, PMCID 2682512.

So, what if (like me) you’d rather get back a list of PubMed IDs? Well we can call ELink again to translate them. This becomes a two step process, so by now you should expect to use the history feature to accomplish it (Section 8.15).

But first, taking the more straightforward approach of making a second (separate) call to ELink:

>>> results2 = Entrez.read(Entrez.elink(dbfrom="pmc", db="pubmed", LinkName="pmc_pubmed",
...                                     from_uid=",".join(pmc_ids)))
>>> pubmed_ids = [link["Id"] for link in results2[0]["LinkSetDb"][0]["Link"]]
>>> pubmed_ids
['19698094', '19450287', '19304878', ..., '15985178']
This time you can immediately spot the Biopython application note as the third hit (PubMed ID 19304878).

Now, let’s do that all again but with the history …TODO.

And finally, don’t forget to include your own email address in the Entrez calls.

python - 如何在 python 爬虫中访问具有多个页面的表单的 pubmed 数据

1 回答 1

Related

Reference