html - 如何修复不合规的 HTML，以便 Expat 解析它（htmltidy 不起作用）

Question

我正在尝试从http://www.nfl.com/scores抓取信息（特别是，找出游戏何时结束，以便我的计算机可以停止记录）。我可以很容易地下载 HTML，它声称符合标准：

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

但

尝试用Expat解析它会产生错误not well-formed (invalid token)。
W3C 的在线验证服务报告399 个错误和 121 个警告。
我尝试tidy使用该选项在我的 Linux 系统上运行 HTML tidy（刚刚称为）-xml，但 tidy 报告 56 个警告和 117 个错误，并且无法恢复良好的 XML 文件。错误如下所示：
```
line 409 column 122 - Warning: unescaped & or unknown entity "&role"
...
line 409 column 172 - Warning: unescaped & or unknown entity "&tabSeq"
...
line 1208 column 65 - Error: unexpected </td> in <br>
line 1209 column 57 - Error: unexpected </tr> in <br>
line 1210 column 49 - Error: unexpected </table> in <br>
```
但是当我检查输入时，“未知实体”似乎是正确引用的 URL 的一部分，所以我不知道某处是否缺少双引号或什么。

我知道有一些东西可以解析这些东西，因为 Firefox 和 w3m 都显示了一些合理的东西。 什么工具可以修复不兼容的 HTML，以便我可以用 Expat 解析它？

score 4 · Accepted Answer

他们在分数框上使用某种 Javascript，所以你将不得不玩更聪明的技巧（我的换行符）：

/* box of awesome */
// iscurrentweek ? true;
(new nfl.scores.Game('2009112905','54635',{state:'pre',container:'scorebox-2009112905',
wrapper:'sb-wrapper-2009112905',template:($('scorebox-2009112905').innerHTML),homeabbr:'NYJ',
awayabbr:'CAR'}));

但是，为了回答您的问题，BeautifulSoup 解析它（似乎）很好：

fp = urlopen("http://www.nfl.com/scores")
data = ""
while 1:
    r = fp.read()
    if not r:
        break
    data += r
fp.close()

soup = BeautifulSoup(data)
print soup.contents[2].contents[1].contents[1]

输出：

<title>NFL Scores: 2009 - Week 12</title>

在我看来，可能更容易刮掉雅虎的 NFL 记分牌……事实上，去试试吧。

编辑：以您的问题为借口来学习 BeautifulSoup。Alex Martelli 一直在歌颂它，所以我认为值得一试——伙计，我印象深刻吗？

无论如何，我能够从 Yahoo! 制作一个基本的分数刮板！记分牌，像这样：

def main():
    soup = BeautifulSoup(YAHOO_SCOREBOARD)
    on_first_team = True
    scores = []
    hold = None

    # Iterate the tr that contains a team's box score
    for item in soup(name="tr", attrs={"align": "center", "class": "ysptblclbg5"}):
        # Easy
        team = item.b.a.string

        # Get the box scores since we're industrious
        boxscore = []
        for quarter in item(name="td", attrs={"class": "yspscores"}):
            boxscore.append(int(quarter.string))

        # Final score
        sub = item(name="span", attrs={"class": "yspscores"})[0]
        if sub.b:
            # Winning score
            final = int(sub.b.string)
        else:
            data = sub.string.replace("&nbsp;", "")
            if ":" in data:
                # Catch TV: XXX and 0:00pm ET
                final = None
            else:
                try: final = int(data)
                except: final = None

        if on_first_team:
            hold = { team : (boxscore, final) }
            on_first_team = False
        else:
            hold[team] = (boxscore, final)
            scores.append(hold)
            on_first_team = True

    for game in scores:
        print "--- Game ---"
        for team in game:
            print team, game[team]

我会在周日调整它，看看它是如何运作的，因为它真的很粗糙。这是它现在输出的内容：

--- Game ---
Green Bay ([0, 13, 14, 7], 34)
Detroit ([7, 0, 0, 5], 12)
--- Game ---
Oakland ([0, 0, 7, 0], 7)
Dallas ([3, 14, 0, 7], 24)

看那个，我也抢到了盒子分数......对于尚未发生的游戏，我们得到：

--- Game ---
Washington ([], None)
Philadelphia ([], None)

无论如何，一个钉子让你跳。祝你好运。

score 3 · Accepted Answer

nfl.com 的顶部有一个基于 Flash 的自动更新记分牌。对其网络流量的一些监控发现：

http://www.nfl.com/liveupdate/scorestrip/ss.xml

这可能比 HTML 记分板更容易解析。

score 2 · Accepted Answer

查看tagsoup。如果您想在 Java 中获得 DOM 树或 SAX 流，这就是门票。如果你只是想提取特定的信息，Beautiful Soup is a Beautiful Thing。

html - 如何修复不合规的 HTML，以便 Expat 解析它（htmltidy 不起作用）

3 回答 3

Related

Reference