r - 为什么会出现乱码？

Question

为什么我在解析网页时会出现乱码？

我曾经encoding="big-5\\IGNORE"得到正常的字符，但它不起作用。

require(XML)
url="http://www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm"
options(encoding="big-5")
data=htmlParse(url,isURL=TRUE,encoding="big-5\\IGNORE")
tdata=xpathApply(data,"//table[@class='table_grey_border']")
stock <- readHTMLTable(tdata[[1]], header=TRUE, stringsAsFactors=FALSE)

在此处输入图像描述

我应该如何修改我的代码以将乱码变为正常字符？

在此处输入图像描述

@MartinMorgan（下）建议使用

htmlParse(url,isURL=TRUE,encoding="big-5")

这是正在发生的事情的一个例子：

require(XML)
url="http://www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm"
options(encoding="big-5")
data=htmlParse(url,isURL=TRUE,encoding="big-5")
tdata=xpathApply(data,"//table[@class='table_grey_border']")
stock <- readHTMLTable(tdata[[1]], header=TRUE, stringsAsFactors=FALSE)
stock

在此处输入图像描述

总记录应为 1335。在上述情况下为 309 - 许多记录似乎已丢失

这是一个复杂的问题。有很多问题：

格式错误的 html 文件

网络不是标准网络，不是格式良好的 html 文件，让我证明我的观点。
请运行：

url="http://www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm"
txt=download.file(url,destfile="stockbig-5",quiet = TRUE)

stockbig-5用firefox 打开下载的文件怎么样？在此处输入图像描述

如果 html 文件格式正确，R 中的 Iconv 函数错误
，您可以使用

data=readLines(file)
datachange=iconv(data,from="source encode",to="target encode\IGNORE")

当 html 文件格式不正确时，您可以这样做，在此示例中，
请运行，

data=readLines(stockbig-5)

将发生错误。

1: In readLines("stockbig-5") :  
  invalid input found on input connection 'stockbig-5'

您不能在 R 中使用 iconv 函数来更改格式错误的 html 文件中的编码。

但是，您可以在 shell 中执行此操作

score 2 · Accepted Answer

我自己解决了一个晚上，很难。
系统：debian6(locale utf-8)+R2.15(locale utf-8)+gnome terminal(locale utf-8)。
这是代码：

require(XML)
url="http://www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm"
txt=download.file(url,destfile="stockbig-5",quiet = TRUE)
system('iconv -f big-5  -t  UTF-8//IGNORE    stockbig-5  > stockutf-8')
data=htmlParse("stockutf-8",isURL=FALSE,encoding="utf-8\\IGNORE")
tdata=xpathApply(data,"//table[@class='table_grey_border']")
stock <- readHTMLTable(tdata[[1]], header=TRUE, stringsAsFactors=FALSE)
stock

在此处输入图像描述

我希望我的代码更优雅，R 代码中的 shell 命令可能很难看，

system('iconv -f big5 -t UTF-8//忽略 stockgb2312 > stockutf-8')

我尝试用纯R代码替换它，失败了，如何用纯R代码替换它？您可以使用代码在计算机中复制结果。完成了一半，成功了一半，继续尝试。

r - 为什么会出现乱码？

1 回答 1

Related

Reference