c# - 使用格式错误的内容抓取网页时遇到问题

Question

我编写了利用 HtmlAgilityPack 库的 c# 代码，以便抓取位于以下位置的页面：World's Largest Urban Areas (Page 2)。不幸的是，该页面包含格式错误的内容。

我在如何抓取此页面方面陷入僵局。我拥有的当前代码（出现在下面）在解析 HTML 时冻结：

 HtmlNodeCollection cityRecords = _htmlDocument.DocumentNode.SelectNodes("//table[@class='boldtable']//tr[position() != 1]");
 CityNodes = (from node in cityRecords.Descendants()
              where node.Name == "td"
              select node).ToList();

目标是用每个数据点解析页面上列出的每个城市；而已。寻找有关如何修改上述代码或使用另一个免费提供的库的建议。

谢谢！

score 3 · Accepted Answer

3

在解析之前通过 HTML Tidy 运行内容。

http://tidy.sourceforge.net/

于 2009-12-15T16:13:21.997 回答

c# - 使用格式错误的内容抓取网页时遇到问题

1 回答 1

Related

Reference