1

我有以下 html 代码

<ol>
<li>If someone is <b>able</b> to do something, they <a href="/wiki/can" title="can">can</a> do it.
<dl>
<dd><i>I'm busy today, so I won't be <b>able</b> to see you.</i></dd>
</dl>
</li>
</ol>

如何提取<li><dl>标签之间的文本。

我试过这个:

from bs4 import BeautifulSoup

s = """<ol>
    <li>If someone is <b>able</b> to do something, they <a href="/wiki/can" title="can">can</a> do it.
    <dl>
    <dd><i>I'm busy today, so I won't be <b>able</b> to see you.</i></dd>
    </dl>
    </li>
    </ol>
"""

soup = BeautifulSoup(s)

for line in soup.find_all('ol'):
    print line.li.get_text()

这将打印

If someone is able to do something, they can do it.

I'm busy today, so I won't be able to see you.

我只想要第一行。

If someone is able to do something, they can do it.
4

1 回答 1

4

循环遍历对象的后代line.li收集所有NavigableString文本对象,遇到<dl>标签就停下来:

from bs4 import NavigableString

for line in soup.find_all('ol'):
    result = []
    for descendant in line.li.descendants:
        if isinstance(descendant, NavigableString):
            result.append(unicode(descendant).strip())
        elif descendant.name == 'dl':
            break

    print u' '.join(result)

演示:

>>> for line in soup.find_all('ol'):
...     result = []
...     for descendant in line.li.descendants:
...         if isinstance(descendant, NavigableString):
...             result.append(unicode(descendant).strip())
...         elif descendant.name == 'dl':
...             break
...     print u' '.join(result)
... 
If someone is able to do something, they can do it.

如果您想对所有 <li>标签(不仅仅是第一个)执行此操作,则需要循环使用<li>找到的标签.find_all()

for line in soup.find_all('ol'):
    for item in line.find_all('li'):
        result = []
        for descendant in item.descendants:
            if isinstance(descendant, NavigableString):
                result.append(unicode(descendant).strip())
            elif descendant.name == 'dl':
                break

        print u' '.join(result)
于 2013-09-09T11:52:15.153 回答