python - 使用 Python 请求网页运行搜索

Question

我有一个“Uniprot”格式的蛋白质名称列表，我想将它们全部转换为 MGI 格式。如果您访问 www.uniprot.org 并在“查询”栏中输入 uniprot 蛋白质名称，它将生成一个页面，其中包含有关该蛋白质的大量信息，包括其 MGI 名称（尽管页面更靠后）。

比如一个Uniprot名字是“Q9D880”，向下滚动可以看到它对应的MGI名字是“1913775”。

我已经知道如何使用 Python 的 urllib 从页面中提取 MGI 名称。我不知道该怎么做是编写 Python 代码来让主页运行“Q9D880”的查询。我的列表包含 270 个蛋白质名称，因此最好避免将每个蛋白质名称复制并粘贴到查询栏中。

我看到了“Python App 中的 Google 搜索”的帖子，我对这个概念有了更深入的理解，但我怀疑运行 google 搜索与在其他网站上运行搜索功能不同，比如 uniprot.org。

我正在运行 Python 2.7.2，但我愿意实施使用其他版本的 Python 的解决方案。谢谢您的帮助！

score 7 · Accepted Answer

更简单的方法是使用requests库。我为您提供的解决方案还使用 BeautifulSoup4 从页面中获取信息本身。

给定查询参数的字典，您所要做的就是：

from bs4 import BeautifulSoup as BS
for protein in my_protein_list:
    text = requests.get('http://www.uniprot.org/uniprot/' + protein).text
    soup = BS(text)
    MGI = soup.find(name='a', onclick="UniProt.analytics('DR-lines', 'click', 'DR-MGI');").text
    MGI = MGI[4:]
    print protein +' - ' + MGI

score 4 · Accepted Answer

运行搜索似乎在做一个 GET

http://www.uniprot.org/?dataset=uniprot&query=Q9D880&sort=score&url=&lucky=no&random=no

最终将您重定向到

http://www.uniprot.org/uniprot/Q9D880

因此，您应该能够使用urllib或 http 库（我使用httplib2）对该地址执行 GET，参数化 URL 中的蛋白质名称，以便您可以搜索所需的任何蛋白质名称。

score 3 · Accepted Answer

你也可以这样做PyQuery：

>>> from pyquery import PyQuery as pq    
>>> url = "http://www.uniprot.org/uniprot/{name}"
>>> name = "Q9D880"
>>> response = pq(url=url.format(name=name))
>>> print html("a").filter(lambda e: pq(this).text().startswith("MGI:")).text()
MGI:1913775

score 1 · Accepted Answer

查询在网址中，可以调用：
http ://www.uniprot.org/uniprot/?query=1913775&sort=score

我没有时间测试这个脚本，因为我没有安装 2.x，但是 2.x 中的代码应该是这样的：

import urllib
MGIName = "1913775"
print urllib.urlopen(
    "http://www.uniprot.org/uniprot/?query="+ MGIName +"&sort=score").read()

我运行的 3.2 中的代码是这样的，它运行良好：

>>> import urllib.request
>>> MGIName = "1913775"
>>> print(urllib.request.urlopen("http://www.uniprot.org/uniprot/?query="+ MGIName +"&sort=score").read())

只需在名称列表上循环 MGIname

python - 使用 Python 请求网页运行搜索

4 回答 4

Related

Reference