xml - 如何修改 R 中的顶级 XML 节点？

Question

我想向 xml 文件的最顶部节点添加一个属性，然后保存该文件。我已经尝试了我能想到的所有 xpath 和子集的组合，但似乎无法让它发挥作用。使用一个简单的例子：

xml_string = c(
 '<?xml version="1.0" encoding="UTF-8"?>',
 '<retrieval-response status = "found">',
      '<coredata>',
           '<id type = "author" >12345</id>',
      '</coredata>',
      '<author>',
           '<first>John</first>',
           '<last>Doe</last>',
      '</author>',
 '</retrieval-response>')

# parse xml content
xml = xmlParse(xml_string)

当我尝试

xmlAttrs(xml["/retrieval-response"][[1]]) <- c(id = 12345)

我收到一个错误：

object of type 'externalptr' is not subsettable

但是，插入了属性，所以我不确定我做错了什么。

（更多背景：这是来自 Scopus API 的数据的简化版本。我正在组合数千个类似结构的 xml 文件，但 id 位于“coredata”节点中，它是“author”节点的兄弟，其中包含所有数据,所以当我使用SAS将组合的XML文档编译成数据集时,id和数据之间没有联系.我希望将id添加到层次结构的顶部会导致它向下传播到所有其他级别）。

score 2 · Accepted Answer

为了将 XML 数据按照数据集和数据框的结构迁移到二维的行和列中，必须将所有的嵌套移除到仅迭代父级和一个子级。因此，XSLT是一种特殊用途的声明性编程语言，它可以根据任何细微的需求重构 XML 文档，它可以方便地重构 XML 数据以满足最终使用的需求。

给定您的示例 XML，下面是一个可以运行的 XSLT，并且生成的 XML 可以成功导入 SAS。让 SAS 代码循环以重组所有数千个 XML 文件。

XSLT （另存为 .xsl 或 .xslt 格式）

 <xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
       xmlns:ait="http://www.elsevier.com/xml/ani/ait"
       xmlns:ce="http://www.elsevier.com/xml/ani/common"
       xmlns:cto="http://www.elsevier.com/xml/cto/dtd"
       xmlns:dc="http://purl.org/dc/elements/1.1/"
       xmlns:ns1="http://webservices.elsevier.com/schemas/search/fast/types/v4"
       xmlns:prism="http://prismstandard.org/namespaces/basic/2.0/"
       xmlns:xocs="http://www.elsevier.com/xml/xocs/dtd"
       xmlns:xoe="http://www.elsevier.com/xml/xoe/dtd"
       exclude-result-prefixes="ait ce cto dc ns1 prism xocs xoe">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />

 <xsl:template match="author-retrieval-response">
  <xsl:variable select="substring-after(coredata/dc:identifier, ':')" name="authorid"/>
  <root>
      <coredata>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="coredata/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="concat(.,@href)"/>
          </xsl:element>
        </xsl:for-each>
      </coredata>

      <subjectAreas>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="subject-areas/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </subjectAreas>

      <authorname>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="author-profile/preferred-name/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </authorname>

      <classifications>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="author-profile/classificationgroup/classifications/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </classifications>

      <journals>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="author-profile/journal-history/journal/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </journals>

      <ipdoc>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="author-profile/affiliation-current/affiliation/ip-doc/*[not(local-name()='address')]">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </ipdoc>

      <address>
        <authorid><xsl:value-of select="$authorid"/></authorid>
        <xsl:for-each select="author-profile/affiliation-current/affiliation/ip-doc/address/*">          
          <xsl:element name="{local-name()}">      
            <xsl:value-of select="."/>
          </xsl:element>
        </xsl:for-each>
      </address>  
  </root>
 </xsl:template>

</xsl:transform>

SAS（使用上述脚本）

proc xsl 
    in="C:\Path\To\Original.xml"
    out="C:\Path\To\Output.xml"
    xsl="C:\Path\To\XSLT.xsl";
run;

** STORING XML CONTENT;
libname temp xml 'C:\Path\To\Output.xml'; 

** APPEND CONTENT TO SAS DATASETS;
data Work.Coredata; 
    retain authorid;
    set temp.Coredata;  ** NAME OF PARENT NODE IN XML;
run;

data Work.SubjectAreas; 
    retain authorid;
    set temp.SubjectAreas;  ** NAME OF PARENT NODE IN XML;
run;

data Work.Authorname;   
    retain authorid;
    set temp.Authorname;  ** NAME OF PARENT NODE IN XML;
run;

data Work.Classifications;
    retain authorid;
    set temp.Classifications;  ** NAME OF PARENT NODE IN XML;
run;

data Work.Journals; 
    retain authorid;
    set temp.Journals;  ** NAME OF PARENT NODE IN XML;
run;

data Work.Ipdoc;    
    retain authorid;
    set temp.Ipdoc;  ** NAME OF PARENT NODE IN XML;
run;

XML OUTPUT （作为 Authorsdata 数据集导入一行和 40 个变量）

<?xml version="1.0" encoding="UTF-8"?>
<root>
   <coredata>
      <authorid>1234567</authorid>
      <url>http://api.elsevier.com/content/author/author_id/1234567</url>
      <identifier>AUTHOR_ID:1234567</identifier>
      <eid>9-s2.0-1234567</eid>
      <document-count>3</document-count>
      <cited-by-count>95</cited-by-count>
      <citation-count>97</citation-count>
      <link>http://api.elsevier.com/content/search/scopus?query=refauid%1234567%29</link>
      <link>http://www.scopus.com/authid/detail.url?partnerID=HzOxMe3b&amp;authorId=1234567&amp;origin=inward</link>
      <link>http://api.elsevier.com/content/author/author_id/1234567</link>
      <link>http://api.elsevier.com/content/search/scopus?query=au-id%281234567%29</link>
   </coredata>
   <subjectAreas>
      <authorid>1234567</authorid>
      <subject-area>Human-Computer Interaction</subject-area>
      <subject-area>Control and Systems Engineering</subject-area>
      <subject-area>Software</subject-area>
      <subject-area>Computer Vision and Pattern Recognition</subject-area>
      <subject-area>Artificial Intelligence</subject-area>
   </subjectAreas>
   <authorname>
      <authorid>1234567</authorid>
      <initials>A.</initials>
      <indexed-name>John A.</indexed-name>
      <surname>John</surname>
      <given-name>Doe</given-name>
   </authorname>
   <classifications>
      <authorid>1234567</authorid>
      <classification>1709</classification>
      <classification>2207</classification>
      <classification>1712</classification>
      <classification>1707</classification>
      <classification>1702</classification>
   </classifications>
   <journals>
      <authorid>1234567</authorid>
      <sourcetitle>Very Prestigious Journal</sourcetitle>
      <sourcetitle-abbrev>V PRES JOU Autom</sourcetitle-abbrev>
      <issn>10504729</issn>
      <sourcetitle>2005 Another Prestigious Journal</sourcetitle>
      <sourcetitle-abbrev>An. Prest. Jou. </sourcetitle-abbrev>
   </journals>
   <ipdoc>
      <authorid>1234567</authorid>
      <afnameid>Prestigious University#1111111</afnameid>
      <afdispname>Prestigious University University</afdispname>
      <preferred-name>Prestigious University University</preferred-name>
      <sort-name>Prestigious University</sort-name>
      <org-domain>pu.edu</org-domain>
      <org-URL>http://www.pu.edu/index.shtml</org-URL>
   </ipdoc>
   <address>
      <authorid>1234567</authorid>
      <address-part>1234 Prestigious Lane</address-part>
      <city>City</city>
      <state>ST</state>
      <postal-code>12345</postal-code>
      <country>United States</country>
   </address>
</root>

替代品

由于不存在全面的 R XSLT 库，因此必须直接用 R 语言进行解析。但是，R 可以通过命令行、 RCOMClient包和其他接口调用其他可执行文件（即 Python、Saxon、VBA）的 XSLT 处理器。

xmlToDataFrame()尽管如此，R 可以通过and xpathSApply()（后者类似于XPath）为提取 XML 数据authorid：

library(XML)

coredata <- xmlToDataFrame(nodes = getNodeSet(doc, '//coredata'))
coredata$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                          xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

subjectareas <- xmlToDataFrame(nodes = getNodeSet(doc, "//subject-areas"))
subjectareas$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                              xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

authorname <-  xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/preferred-name'))
authorname$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                            xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

classifications <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/classificationgroup/classifications'))
classifications$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                                 xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

journal <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/journal-history/journal'))
journal$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                         xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

ipdoc <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/affiliation-current/affiliation/ip-doc'))
ipdoc$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                       xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

address <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/affiliation-current/affiliation/ip-doc/address'))
address$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
                         xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

score 1 · Accepted Answer

编辑： 在尝试编辑顶部节点的方法（见下面的旧答案）之后，我意识到编辑顶部节点并不能解决我的问题，因为 SAS XML 映射器没有保留所有的 id。

我尝试了一种将作者 ID 添加到每个运行良好的子节点的新方法。我还了解到，您可以使用 XPath 通过将它们放入向量中来选择多个节点，如下所示：

c("//coredata",
  "//affiliation-current",
  "affiliation-history",
  "subject-areas",
  "//author-profile")

所以我使用的最后一个程序是：

files <- list.files()

for (i in 1:length(files)) {
     author_record <- xmlParse(files[i])

     xpathApply(
          author_record, c(
               "//coredata",
               "//affiliation-current",
               "affiliation-history",
               "subject-areas",
               "//author-profile"
          ),
          addAttributes,
          auth_id = gsub("AUTHOR_ID:", "", xmlValue(author_record[["//dc:identifier"]]))
     )

     saveXML(author_record, file = files[i])
}

旧答案： 经过大量实验，我找到了一个相当简单的解决方案来解决我的问题。

只需使用即可将属性添加到顶部节点

addAttributes(xmlRoot(xmlfile), attribute = "attributeValue")

对于我的具体情况，最直接的解决方案将是一个简单的循环：

setwd("C:/directory/with/individual/xmlfiles")

files <- list.files()

for (i in 1:length(files)) {

 author_record <- xmlParse(files[i])

 addAttributes(node = xmlRoot(author_record), 
               id   = gsub   (pattern = "AUTHOR_ID:", 
                              replacement = "", 
                              x = xmlValue(auth[["//dc:identifier"]])
               )
 )

  saveXML(author_record, file = files[i])
}

我确信有更好的方法。显然我需要学习 XLST，这是一个非常强大的方法！

xml - 如何修改 R 中的顶级 XML 节点？

2 回答 2

Related

Reference