1
<p align="JUSTIFY"><a href="#abcd"> Mr A </a></p>
<p align="JUSTIFY">I </p>
<p align="JUSTIFY"> have a question </p>
<p align="JUSTIFY">&nbsp;</p>
<p align="JUSTIFY"><a href="#mnop"> Mr B </a></p>
<p align="JUSTIFY">The </p>
<p align="JUSTIFY">answer is</p>
<p align="JUSTIFY">not there</p>
<p align="JUSTIFY">&nbsp;</p>
<p align="JUSTIFY"><a href="wxyz"> Mr C </a></p>
<p align="JUSTIFY">Please</p>
<p align="JUSTIFY">Help</p>

我想在&nbsp;.

  • 第一次迭代应该显示我有一个问题
  • 第二次迭代应该显示答案不存在
  • 人名也应该在不同的列表中提取..例如 ['Mr A','Mr B','Mr C']

如果有人知道该怎么做,它可能会很有用,因为我正在尝试学习 python 遇到了这个问题。我尝试的代码是

for t in soup.findAll('p',text = re.compile('&nbsp;'), attrs = {'align' : 'JUSTIFY'}):
    print t
    for item in t.parent.next_siblings:
        if isinstance(item, Tag):
            if 'p' in item.attrs and 'align' in item.attrs['p']:
                break
            print item

它返回 [] 这不是想要的

4

2 回答 2

3

你可以用 BeautifulSoup 做到这一点:

from bs4 import BeautifulSoup

s = ""

html = '<p align="JUSTIFY">I </p>\
<p align="JUSTIFY"> have a question </p>\
<p align="JUSTIFY">&nbsp;</p>\
<p align="JUSTIFY">The </p>\
<p align="JUSTIFY">answer is</p>\
<p align="JUSTIFY">not there</p>\
<p align="JUSTIFY">&nbsp;</p>\
<p align="JUSTIFY">Please</p>\
<p align="JUSTIFY">Help</p>'

soup = BeautifulSoup(html)
title = soup.findAll("p", {"align" : "JUSTIFY"})

for i in title:
    s += ''.join(i.contents)

f =  s.split("&nbsp;")
for i in f:
    print i
于 2013-08-08T10:56:00.827 回答
0

使用正则表达式的另一种方法:

from re import sub

html = '<p align="JUSTIFY">I </p>\
<p align="JUSTIFY"> have a question </p>\
<p align="JUSTIFY">&nbsp;</p>\
<p align="JUSTIFY">The </p>\
<p align="JUSTIFY">answer is</p>\
<p align="JUSTIFY">not there</p>\
<p align="JUSTIFY">&nbsp;</p>\
<p align="JUSTIFY">Please</p>\
<p align="JUSTIFY">Help</p>'

print [sub("\s+", " ", x).strip() for x in sub("<.*?>", " ", html).split("&nbsp;")]

输出:

['I have a question', 'The answer is not there', 'Please Help']
于 2013-08-08T14:07:34.803 回答