2

I get a strange encoding problem when I try to parse a certain attribute of an xml/html document. Here a reproducible example , containing 2 items with 2 titles (note the use of french accent here)

library(XML)
doc <- htmlParse('<note>
              <item title="é">1</item>
              <item title="ï">3</item>
          </note>',asText=TRUE,encoding='UTF-8')

Now using xpathApply , I can read my items like this. Note that special accents are well formatted here.

xpathApply(doc,'//item')

[[1]]
<item title="é">1</item> 

[[2]]
<item title="ï">3</item> 

But When I try to read my attribute title , I get this :

xpathApply(doc,'//item',xmlGetAttr,'title')
[[1]]
[1] "é"

[[2]]
[1] "ï"

I tried other xpath versions like :

  xpathApply(doc,'//item/@title') 
  xmlAttrs(xpathApply(doc,'//item')[[1]])

But this doesn't work. Any help please?

4

1 回答 1

2

Its not pretty and I cant test on this linux machine but try:

  xpathApply(doc,'//item',
         function(x) iconv(xmlAttrs(x,'title'), "UTF-8", "UTF-8"))
[[1]]
title 
  "é" 

[[2]]
title 
  "ï" 

xmlAttrs calls RS_XML_xmlNodeAttributes examining this code there appears to be no facility for handling encoding. xmlValue calls R_xmlNodeValue this has encoding added. Looking at ?xmlValue we have encoding: experimental functionality and parameter related to encoding. Maybe encoding on the attributes will be added at a later date.

于 2013-05-15T10:54:38.990 回答