我正在尝试使用 httr 包抓取以 UTF-8 编码的网站,但显然该content
包的功能仅允许在将网站解析为文本时指定编码。不幸的是,我无法将其解析为文本,因为我想在之后对其使用 xpath 查询。这是一个例子:
library(XML)
library(httr)
page <- GET("http://ec.europa.eu/archives/commission_2004-2009/index_en.htm")
test <- content(page, as = "parsed")
# Get a list of names, many of which contain non-standard characters
xpathSApply(test, "//img", xmlGetAttr, "alt")
# This gives the correct encoding, but outputs a character vector,
# on which I cannot use xpath queries
test <- content(page, as = "text", encoding = "utf-8")
更新:
# htmlParse returns a parsed document, but the non-standard characters are
# not properly encoded, i.e. the result is the same whether or not I specify the
# "encoding" argument
test <- htmlParse(page, encoding = "UTF-8")
# Non-standard characters in names still not properly encoded
xpathSApply(test, "//img", xmlGetAttr, "alt")