0

What would be the best way to split a HTML document/string based on the occurrence of the
tag? I have given the solution I currently have below but it seems quite cumbersome and isn't all that easy to read I think. I also experimented with regex's but I'm told I should not use regex's to parse HTML

for i, br in enumerate(soup.findAll('b')):
line_value = ''
line_values = []
next = br.next
while (next):
    if next and isinstance(next, Tag) and next.name == 'br':
        line_values.append(line_value)
        line_value = ''
    else:
        stripped_text = ''.join(BeautifulSoup(str(next).strip()).findAll(text=True))
        if stripped_text:
            line_value += stripped_text
    next = next.nextSibling
print line_values

Here's a sample of the HTML I'm parsing:

<p><font size="1" color="#800000"><b>09:00
  <font> - </font>
  11:00
  <br>
  CE4817
  <font> - </font>LAB <font>- </font>
  2A
  <br>
   B2043 B2042
  <br>

  Wks:1-13
  </b></font>
  </p>

And the current results of my code:

[u'09:00 - 11:00', u'CE4817 - LAB- 2A', u'B2043 B2042']
[u'11:00 - 12:00', u'CE4607 - TUT- 3A', u'A1054']
4

2 回答 2

0

用正则表达式拆分

import re
p = re.compile(r'<br>')
filter(None, p.split(yourString))

然后,您可以从数组中每个返回的字符串中删除其他 html 标记。

您可以使用现有函数,如从 python 中的字符串中剥离 html,或者检查我对剥离 HTML 标记而不使用 HtmlAgilityPack 问题的回答。

还要检查这个答案:RegEx match open tags except XHTML self-contained tags

真的应该使用 html 解析器来完成你的任务

于 2012-09-24T15:21:27.880 回答
0

尝试这个 :

正则表达式

<p><font size="1" color="#800000"><b>(\d{2}:\d{2}).*?(\d{2}:\d{2}).*?(\w{2}\d{4}).*?<font> - </font>(\w+)\s*<font>- </font>\s*(\d\w)\s*<br>\s*(\w\d{4}\s*\w\d{4})\s*<br>[\s\S]*?</p>

模式

m - 多行

只要 html 代码的结构没有改变,这将起作用。

于 2012-09-24T15:22:04.917 回答