r - 如何使用 R 转换网页抓取中的特殊符号？

Question

我正在学习如何使用XML和RCurl包来抓取网络。一切顺利，除了一件事。特殊字符，如 ö 或 č，它们在 R 中的读入方式不同。例如，í 读入为 ÃƒÂ。我假设后者是第一个的某种 HTML 编码。

我一直在寻找一种方法来转换这些字符，但我还没有找到。我相信其他人也偶然发现了这个问题，我怀疑必须有某种函数来转换这些字符。有谁知道解决方案？提前致谢。

这是代码示例，抱歉我之前没有提供。

library(XML)
url <-   'http://en.wikipedia.org/wiki/2000_Wimbledon_Championships_%E2%80%93_Men%27s_Singles'
tables <- readHTMLTable(url)
Sec <- tables[[6]]
pl1R1 <- unlist(strsplit(as.character(Sec[,2]), ' '))[seq(2,32, 4)]
enc2utf8(pl1R1) # does not seem to work

score 0 · Accepted Answer

尝试在指定编码时先解析它，然后读取表格，如下所示：readHTMLTable and UTF-8 encoding。

一个例子可能是：

library(XML)
url <- "http://en.wikipedia.org/wiki/2000_Wimbledon_Championships_%E2%80%93_Men%27s_Singles"
doc <- htmlParse(url, encoding = "UTF-8") #this will preserve characters
tables <- as.data.frame(readHTMLTable(doc, stringsAsFactors = FALSE))
Sec <- tables[[6]]
#not sure what you're trying to do here though
pl1R1 <- unlist(strsplit(as.character(Sec[,2]), ' '))[seq(2,32, 4)]

r - 如何使用 R 转换网页抓取中的特殊符号？

1 回答 1

Related

Reference