5

我正在使用 PHP 和 libtidy 尝试筛选可能是历史上最可怕和格式错误的 HTML 表格使用。该站点关闭了几个表、tr、td、字体或粗体标签,并始终在表中嵌套许多不同的表层。

示例片段:

<center>
<table border="1" bordercolor="#000000" cellspacing="0" cellpadding="0">
<tr>
<td width="50%">
<center>
Home Team - <b>Wildcats<td>
<center>
Away Team - <b>Polar Bears<tr>
<td colspan="2">
<center>
<b><font size="+1">Rosters<tr>
<td valign="top">
<center>
<table border="0" cellspacing="0">
<tr>
<td>
<font size="2">1&nbsp;<td>
<font size="2">Baird, T<tr>
<td>
<font size="2">2&nbsp;<td>
<font size="2">Knight, P<tr>
<td>
<font size="2">8&nbsp;<td>
<font size="2">Miller, B<tr>
<td>
<font size="2">9&nbsp;<td>
<font size="2">Huebsch, B<tr>
<td>
<font size="2">11&nbsp;<td>
<font size="2">Buschmann, C<tr>
<td>
<font size="2">12&nbsp;<td>
<font size="2">Reding, J<tr>
<td>
<font size="2">14&nbsp;<td>
<font size="2">Simpson, S<tr>
<td>
<font size="2">27&nbsp;<td>
<font size="2">Kupferschmidt, M<tr>
<td>
<font size="2">28&nbsp;<td>
<font size="2">Anderson, D<tr>
<td>
<font size="2">31&nbsp;<td>
<font size="2">Gehrts, J<tr>
<td>
<font size="2">39&nbsp;<td>
<font size="2">McGinnis, G<tr>
<td>
<font size="2">42&nbsp;<td>
<font size="2">Temple, B<tr>
<td>
<font size="2">44&nbsp;<td>
<font size="2">Kemplin, A<tr>
<td>
<font size="2">77&nbsp;<td>
<font size="2">Weiner, B<tr>
<td>
<font size="2">95&nbsp;<td>
<font size="2">
Zytkoskie, D</table>
<td valign="top">
<center>
<table border="0" cellspacing="0">
<tr>
<td>
<font size="2">5&nbsp;<td>
<font size="2">Mack, A<tr>
<td>
<font size="2">8&nbsp;<td>
<font size="2">Foucault, R<tr>
<td>
<font size="2">11&nbsp;<td>
<font size="2">Oberpriller, D *<tr>
<td>
<font size="2">12&nbsp;<td>
<font size="2">Underwood, J<tr>
<td>
<font size="2">15&nbsp;<td>
<font size="2">Oberpriller, M<tr>
<td>
<font size="2">19&nbsp;<td>
<font size="2">Langfus, B<tr>
<td>
<font size="2">25&nbsp;<td>
<font size="2">Carroll, R<tr>
<td>
<font size="2">30&nbsp;<td>
<font size="2">Hirdler, T<tr>
<td>
<font size="2">33&nbsp;<td>
<font size="2">Gibson, S<tr>
<td>
<font size="2">35&nbsp;<td>
<font size="2">Marthaler, C<tr>
<td>
<font size="2">44&nbsp;<td>
<font size="2">Yurik, J<tr>
<td>
<font size="2">58&nbsp;<td>
<font size="2">
Gronemeyer, S</table>
<tr>
<td colspan="2">
<center>
<b><font size="+1">Goals<tr>
<td valign="top">
<center>
<table border="1" cellspacing="0" width="100%">
<td>
<b><font size="2">Player<td>
<b><font size="2">Period<td>
<b><font size="2">Time<td>
<b><font size="2">Assist 1<td>
<b><font size="2">Assist 2<td>
<b><font size="2">SH<td>
<b><font size="2">PP<tr>
<td nowrap>
<font size="2">Kupferschmidt,&nbsp;M<td>
<font size="2">1<td>
<font size="2">12:51<td nowrap>
<font size="2">Kemplin,&nbsp;A<td nowrap>
<font size="2">None<td>
<font size="2">
<center>
<td>
<font size="2">
<center>
<tr>
<td nowrap>
<font size="2">McGinnis,&nbsp;G<td>
<font size="2">1<td>
<font size="2">12:33<td nowrap>
<font size="2">Huebsch,&nbsp;B<td nowrap>
<font size="2">None<td>
<font size="2">
<center>
<td>
<font size="2">
<center>
<tr>
<td nowrap>
<font size="2">Kupferschmidt,&nbsp;M<td>
<font size="2">2<td>
<font size="2">16:01<td nowrap>
<font size="2">None<td nowrap>
<font size="2">None<td>
<font size="2">
<center>
<td>
<font size="2">
<center>
<tr>
<td nowrap>
<font size="2">Buschmann,&nbsp;C<td>
<font size="2">3<td>
<font size="2">00:38<td nowrap>
<font size="2">None<td nowrap>
<font size="2">None<td>
<font size="2">
<center>
<td>
<font size="2">
<center>
</table>
<td valign="top">
<center>
<table border="1" cellspacing="0" width="100%">
<td>
<b><font size="2">Player<td>
<b><font size="2">Period<td>
<b><font size="2">Time<td>
<b><font size="2">Assist 1<td>
<b><font size="2">Assist 2<td>
<b><font size="2">SH<td>
<b><font size="2">PP<tr>
<td nowrap>
<font size="2">Oberpriller,&nbsp;D *<td>
<font size="2">3<td>
<font size="2">12:31<td nowrap>
<font size="2">Gronemeyer,&nbsp;S<td nowrap>
<font size="2">None<td>
<font size="2">
<center>
<td>
<font size="2">
<center>
</table>
<tr>
<td colspan="2">
<center>
<b><font size="+1">Penalties<tr>
<td valign="top">
<center>
<table border="1" cellspacing="0" width="100%">
<td>
<b><font size="2">Player<td>
<font size="2"><b>Period<td>
<font size="2"><b>Minutes<td>
<font size="2"><b>Offense<td>
<font size="2"><b>Start<td>
<font size="2"><b>Expired<tr>
<td nowrap>
<font size="2">Buschmann,&nbsp;C<td>
<font size="2">
<center>
3<td>
<font size="2">
<center>
2<td>
<font size="2">Interference<td>
<font size="2">11:11<td>
<font size="2">09:11<tr>
<td nowrap>
<font size="2">Buschmann,&nbsp;C<td>
<font size="2">
<center>
3<td>
<font size="2">
<center>
2<td>
<font size="2">Unsportmanlike Conduct<td>
<font size="2">03:26<td>
<font size="2">01:26<tr>
<td nowrap>
<font size="2">Bench<td>
<font size="2">
<center>
3<td>
<font size="2">
<center>
2<td>
<font size="2">Too Many Men<td>
<font size="2">01:46<td>
<font size="2">
00:00</table>
<td valign="top">
<center>
<table border="1" cellspacing="0" width="100%">
<td>
<b><font size="2">Player<td>
<font size="2"><b>Period<td>
<font size="2"><b>Minutes<td>
<font size="2"><b>Offense<td>
<font size="2"><b>Start<td>
<font size="2"><b>Expired<tr>
<td nowrap>
<font size="2">Marthaler,&nbsp;C<td>
<font size="2">
<center>
1<td>
<font size="2">
<center>
2<td>
<font size="2">Interference<td>
<font size="2">01:19<td>
<font size="2">16:19<tr>
<td nowrap>
<font size="2">Underwood,&nbsp;J<td>
<font size="2">
<center>
2<td>
<font size="2">
<center>
2<td>
<font size="2">Interference<td>
<font size="2">12:32<td>
<font size="2">10:32<tr>
<td nowrap>
<font size="2">Marthaler,&nbsp;C<td>
<font size="2">
<center>
3<td>
<font size="2">
<center>
2<td>
<font size="2">Interference<td>
<font size="2">11:39<td>
<font size="2">
09:39</table>
<tr>
<td colspan="2">
<center>
<font size="+1"><b>Goalies<tr>
<td>
<center>
<table border="1" cellspacing="0" width="100%">
<td>
<b><font size="2">Name<td>
<font size="2"><b>Shots<td>
<font size="2"><b>Goals<tr>
<td>
<font size="2">Baird,&nbsp;T<td>
<font size="2">20<td>
<font size="2">1<tr>
<td>
<font size="2"><b>Open Net<td>
<td>
<font size="2">
0</table>
<td>
<center>
<table border="1" cellspacing="0" width="100%">
<td>
<b><font size="2">Name<td>
<font size="2"><b>Shots<td>
<font size="2"><b>Goals<tr>
<td>
<font size="2">Hirdler,&nbsp;T<td>
<font size="2">42<td>
<font size="2">

神奇的是,所有浏览器似乎都可以很好地呈现这个。PHPTidy 设法很好地理解了这一切,但是表嵌套得太深而且几乎是随机的,以至于使用 DOM XPath 很难遍历它。

有人对其他方法有什么建议吗?

事后分析:在喝了太多比利时小麦啤酒并弄脏了我的代码之后,我通过 strip_tags() 删除了除 table、tr 和 td 之外的所有标签,然后通过 libtidy 运行它,得到了很好的结果。现在它的格式很漂亮,很容易遍历。似乎在将其发送到解析器之前只需要稍微按摩一下。

4

5 回答 5

3

您可以使用一些技巧来清理高度可预测的结构,例如表格。在运行 HTML tidy 之前,您可以使用 Regex 或其他东西来搜索<tr>'s 和<td>'s 后面跟着另一个<tr>or <td>,并在它之前插入相应的关闭器。在 a 中容纳桌子有一些额外的技巧,<td>但没有什么是不可能处理的。只需从找到最里面的结构并从那里向外移动开始。

真正的难题是诸如未闭合<div>的 's 和<p>'s 之类的东西,它们可能更难与它们相应的(或缺少的)闭合器匹配。

于 2009-04-09T01:59:20.893 回答
2

如果您对 Python 等其他语言持开放态度,Beautiful Soup非常擅长重构编写不佳的 HTML。我刚刚尝试通过以下代码段运行您的 HTML,它现在非常易读。

#!/usr/bin/env python

from BeautifulSoup import BeautifulSoup

html = "long string of html"
soup = BeautifulSoup(html)
print soup.prettify()
于 2009-04-09T02:02:22.027 回答
2

如果您正在查找数据,我将删除所有 html并将其作为逐行原始输入处理。您可以使用strip_tags功能。

$clean = strip_tags($input);

// example: <p>Test paragraph.</p> <a href="#fragment">Other text</a>
// returns: Test paragraph. Other text
于 2009-04-09T04:19:54.177 回答
0

也许您会更幸运地使用正则表达式而不是将其解析为 XML 来抓取所需的结果。

于 2009-04-09T01:58:14.890 回答
0

我使用 xpath 和 Python 的lxml库来解析 IMDB Top 250 页面。自己查看源代码,看看它有多糟糕。

以下代码解析保存的 IMDB Top 250 页面 ( top250.html) 并将提取的信息存储在 sqlite 数据库中 ( top250.db)

import sqlite3
from lxml import html

tree = html.parse('top250.html')

class TopMovie(object):
    base_xpath = "/html/body/div/div[2]/layer/div[3]/table/tr/td[3]/div/table/tr/td/table/tr[%d]"

    def __init__(self, num):
        self.rank = num
        self.xpath = self.base_xpath % (self.rank + 1)

    def rating(self):
        return tree.xpath(self.xpath + '/td[2]/font')[0].text

    def link(self):
        return tree.xpath(self.xpath + '/td[3]/font/a')[0].values()[0]

    def title(self):
        return tree.xpath(self.xpath + '/td[3]/font')[0].text_content()

    def votes(self):
        return tree.xpath(self.xpath + '/td[4]/font')[0].text


def main():
    conn = sqlite3.connect('top250.db')
    conn.execute("""DROP TABLE IF EXISTS movies""")
    conn.execute("""
        CREATE TABLE movies (
            id INTEGER PRIMARY KEY,
            title TEXT,
            link TEXT,
            rating TEXT,
            votes INTEGER
        )""")

    for n in xrange(1, 251):
        m = TopMovie(n)
        query = r'INSERT INTO movies VALUES (%d, "%s", "%s", "%s", "%s")' \
            % (n, m.title(), m.link(), m.rating(), m.votes().replace(',', ''))
        conn.execute(query)

    conn.commit()
    conn.close()


if __name__ == "__main__":
    main()
于 2009-04-09T04:37:34.773 回答