r - 使用 XML 包处理 R 中的 HTML 网页抓取错误

Question

我正在尝试像这样抓取一个网页http://www.weatheroffice.gc.ca/city/pages/on-135_metric_e.html，并使用以下代码，我收到一个错误，提示 HTML 不正确：

library(RCurl)
library(XML)
weather <- getURL("http://www.weatheroffice.gc.ca/city/pages/on-135_metric_e.html")
doc <- htmlParse(weather)

我看过这篇文章，它演示了如何使用 Internet Explorer 和rcom包来修复格式不正确的 HTML，然后将其提供给解析器。然而，有问题的 HTML 通过了http://validator.w3.org的验证。

还有哪些其他方法可以使用 XML 包处理与 HTML 解析相关的错误？

score 2 · Accepted Answer

Give this a whirl and see if it does what you're after:

library(RCurl)
library(XML)
url   <- "http://www.weatheroffice.gc.ca/city/pages/on-135_metric_e.html"
doc   <- htmlTreeParse(url, useInternalNodes=TRUE)

I also suggest you check out these resources:

r - 使用 XML 包处理 R 中的 HTML 网页抓取错误

1 回答 1

Related

Reference