0

此代码是从带有唯一标识符“|Rv0153c|”的 HTML 文件中提取的单行 .html 文件:

<TR><TD><small style=font-family:courier> >M. tuberculosis H37Rv|Rv0153c|ptbB<br />MAVRELPGAWNFRDVADTATALRPGRLFRSSELSRLDDAGRATLRRLGITDVADLRSSRE<br />VARRGPGRVPDGIDVHLLPFPDLADDDADDSAPHETAFKRLLTNDGSNGESGESSQSIND<br />AATRYMTDEYRQFPTRNGAQRALHRVVTLLAAGRPVLTHCFAGKDRTGFVVALVLEAVGL<br />DRDVIVADYLRSNDSVPQLRARISEMIQQRFDTELAPEVVTFTKARLSDGVLGVRAEYLA<br />AARQTIDETYGSLGGYLRDAGISQATVNRMRGVLLG<br /></small><TR><td><b><big>Blastp: <a href="http://tuberculist.epfl.ch/blast_output/Rv0153c.fasta.out"> Pre-computed results</a></big></b><TR><td><b><big>TransMembrane prediction using Hidden Markov Models: <a href="http://tuberculist.epfl.ch/tmhmm/Rv0153c.html"> TMHMM</a></big></b><base target="_blank"/><TR><td><b><big>Genomic sequence</big></b><br /><br /><form action="dnaseq.php" method="get">

我想编写一个代码,该代码能够从这行 .html 代码中以给定格式提取给定信息(如下所示):

>M. tuberculosis H37Rv|Rv0153c|ptbB
MAVRELPGAWNFRDVADTATALRPGRLFRSSELSRLDDAGRATLRRLGITDVADLRSSRE
VARRGPGRVPDGIDVHLLPFPDLADDDADDSAPHETAFKRLLTNDGSNGESGESSQSIND
AATRYMTDEYRQFPTRNGAQRALHRVVTLLAAGRPVLTHCFAGKDRTGFVVALVLEAVGL
DRDVIVADYLRSNDSVPQLRARISEMIQQRFDTELAPEVVTFTKARLSDGVLGVRAEYLA
AARQTIDETYGSLGGYLRDAGISQATVNRMRGVLLG
4

2 回答 2

1

您可以使用 Python 和正则表达式库。

from bs4 import BeautifulSoup
import re
sentence = '<TR><TD><small style=font-family:courier> >M. tuberculosis H37Rv|Rv0153c|ptbB<br />MAVRELPGAWNFRDVADTATALRPGRLFRSSELSRLDDAGRATLRRLGITDVADLRSSRE<br />VARRGPGRVPDGIDVHLLPFPDLADDDADDSAPHETAFKRLLTNDGSNGESGESSQSIND<br />AATRYMTDEYRQFPTRNGAQRALHRVVTLLAAGRPVLTHCFAGKDRTGFVVALVLEAVGL<br />DRDVIVADYLRSNDSVPQLRARISEMIQQRFDTELAPEVVTFTKARLSDGVLGVRAEYLA<br />AARQTIDETYGSLGGYLRDAGISQATVNRMRGVLLG<br /></small><TR><td><b><big>Blastp: <a href="http://tuberculist.epfl.ch/blast_output/Rv0153c.fasta.out"> Pre-computed results</a></big></b><TR><td><b><big>TransMembrane prediction using Hidden Markov Models: <a href="http://tuberculist.epfl.ch/tmhmm/Rv0153c.html"> TMHMM</a></big></b><base target="_blank"/><TR><td><b><big>Genomic sequence</big></b><br /><br /><form action="dnaseq.php" method="get">'   
print re.sub('<[^>]*>', '',  sentence)

HTH。

于 2013-10-23T17:06:57.380 回答
1

我认为您正在寻找的是 HTML 解析器: Simple HTML and XHTML parser

于 2013-10-23T14:47:51.000 回答