0

我正在重写一个最初用 php 编码的写得很糟糕的网站。

我正在尝试隔离 ap 标签中的文本,并且想知道如何仅获取文本部分。有任何想法吗?

<p>
<span lang="EN-IE" xml:lang="EN-IE">

<br>
TEXT SAMPLE 1
<br>
<br>
TEXT SAMPLE 2

<span lang="EN-IE" xml:lang="EN-IE">TEXT SAMPLE 3
</span>,

<span lang="EN-IE" xml:lang="EN-IE">&nbsp;TEXT SAMPLE 4
</span>&nbsp;TEXT SAMPLE 5

<span lang="EN-IE" xml:lang="EN-IE">.&nbsp;</span>

</span><span lang="EN-IE" xml:lang="EN-IE">

<br>
<br>

TEXT SAMPLE 6
</span>

<span lang="EN-IE" xml:lang="EN-IE">&nbsp;</span>

TEXT SAMPLE 7

4

1 回答 1

0

BeautifulSoup是一个很好的起点。特别是get_text函数。

这将输出上面片段中的所有文本:

from bs4 import BeautifulSoup

CONTENT = """
<p>
<span lang="EN-IE" xml:lang="EN-IE">

<br>
TEXT SAMPLE 1
<br>
<br>
TEXT SAMPLE 2

<span lang="EN-IE" xml:lang="EN-IE">TEXT SAMPLE 3
</span>,

<span lang="EN-IE" xml:lang="EN-IE">&nbsp;TEXT SAMPLE 4
</span>&nbsp;TEXT SAMPLE 5

<span lang="EN-IE" xml:lang="EN-IE">.&nbsp;</span>

</span><span lang="EN-IE" xml:lang="EN-IE">

<br>
<br>

TEXT SAMPLE 6
</span>

<span lang="EN-IE" xml:lang="EN-IE">&nbsp;</span>

TEXT SAMPLE 7
"""

if __name__ == '__main__':
    soup = BeautifulSoup(CONTENT)
    print soup.get_text()

输出可能需要一些字符串操作,因为有许多新行,但这会去掉 HTML。

于 2013-01-21T03:52:30.747 回答