python - 用 Python 提取 Fasta Moonlight 蛋白质序列

Question

我想通过 Python 从 Moonlighting Protein Database (www.moonlightingproteins.org/results.php?search_text=) 中提取具有氨基酸序列的 FASTA 文件，因为这是一个迭代过程，我更愿意学习如何编程而不是手动执行，b/c 加油，我们在 2016 年。问题是我不知道如何编写代码，因为我是新手程序员 :( 。基本的伪代码是：

 for protein_name in site: www.moonlightingproteins.org/results.php?search_text=:

       go to the uniprot option 

       download the fasta file 

       store it in a .txt file inside a given folder

提前致谢！

score 0 · Accepted Answer

我强烈建议向作者询问数据库。从常见问题解答：

我想在一个项目中使用 MoonProt 数据库，使用生物信息学分析氨基酸序列或结构。

如果您有兴趣使用 MoonProt 数据库分析兼职蛋白质的序列和/或结构，请通过 bioinformatics@moonlightingproteins.org 联系我们。

假设你发现了一些有趣的东西，你将如何在你的论文或论文中引用它？“这些序列是在未经作者同意的情况下从公共网页上抓取的”。最好将功劳归功于原始研究人员。

这是一个很好的关于抓取的介绍

但回到你原来的问题。

import requests
from lxml import html
#let's download one protein at a time, change 3 to any other number
page = requests.get('http://www.moonlightingproteins.org/detail.php?id=3')
#convert the html document to something we can parse in Python
tree = html.fromstring(page.content)
#get all table cells
cells = tree.xpath('//td')

for i, cell in enumerate(cells):
    if cell.text:
        #if we get something which looks like a FASTA sequence, print it
        if cell.text.startswith('>'):
            print(cell.text)
    #if we find a table cell which has UniProt in it
    #let's print the link from the next cell
    if 'UniProt' in cell.text_content():
        if cells[i + 1].find('a') is not None and 'href' in cells[i + 1].find('a').attrib:
            print(cells[i + 1].find('a').attrib['href'])

python - 用 Python 提取 Fasta Moonlight 蛋白质序列

1 回答 1

Related

Reference