为了将 XML 数据按照数据集和数据框的结构迁移到二维的行和列中,必须将所有的嵌套移除到仅迭代父级和一个子级。因此,XSLT是一种特殊用途的声明性编程语言,它可以根据任何细微的需求重构 XML 文档,它可以方便地重构 XML 数据以满足最终使用的需求。
给定您的示例 XML,下面是一个可以运行的 XSLT,并且生成的 XML 可以成功导入 SAS。让 SAS 代码循环以重组所有数千个 XML 文件。
XSLT (另存为 .xsl 或 .xslt 格式)
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:ait="http://www.elsevier.com/xml/ani/ait"
xmlns:ce="http://www.elsevier.com/xml/ani/common"
xmlns:cto="http://www.elsevier.com/xml/cto/dtd"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:ns1="http://webservices.elsevier.com/schemas/search/fast/types/v4"
xmlns:prism="http://prismstandard.org/namespaces/basic/2.0/"
xmlns:xocs="http://www.elsevier.com/xml/xocs/dtd"
xmlns:xoe="http://www.elsevier.com/xml/xoe/dtd"
exclude-result-prefixes="ait ce cto dc ns1 prism xocs xoe">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:template match="author-retrieval-response">
<xsl:variable select="substring-after(coredata/dc:identifier, ':')" name="authorid"/>
<root>
<coredata>
<authorid><xsl:value-of select="$authorid"/></authorid>
<xsl:for-each select="coredata/*">
<xsl:element name="{local-name()}">
<xsl:value-of select="concat(.,@href)"/>
</xsl:element>
</xsl:for-each>
</coredata>
<subjectAreas>
<authorid><xsl:value-of select="$authorid"/></authorid>
<xsl:for-each select="subject-areas/*">
<xsl:element name="{local-name()}">
<xsl:value-of select="."/>
</xsl:element>
</xsl:for-each>
</subjectAreas>
<authorname>
<authorid><xsl:value-of select="$authorid"/></authorid>
<xsl:for-each select="author-profile/preferred-name/*">
<xsl:element name="{local-name()}">
<xsl:value-of select="."/>
</xsl:element>
</xsl:for-each>
</authorname>
<classifications>
<authorid><xsl:value-of select="$authorid"/></authorid>
<xsl:for-each select="author-profile/classificationgroup/classifications/*">
<xsl:element name="{local-name()}">
<xsl:value-of select="."/>
</xsl:element>
</xsl:for-each>
</classifications>
<journals>
<authorid><xsl:value-of select="$authorid"/></authorid>
<xsl:for-each select="author-profile/journal-history/journal/*">
<xsl:element name="{local-name()}">
<xsl:value-of select="."/>
</xsl:element>
</xsl:for-each>
</journals>
<ipdoc>
<authorid><xsl:value-of select="$authorid"/></authorid>
<xsl:for-each select="author-profile/affiliation-current/affiliation/ip-doc/*[not(local-name()='address')]">
<xsl:element name="{local-name()}">
<xsl:value-of select="."/>
</xsl:element>
</xsl:for-each>
</ipdoc>
<address>
<authorid><xsl:value-of select="$authorid"/></authorid>
<xsl:for-each select="author-profile/affiliation-current/affiliation/ip-doc/address/*">
<xsl:element name="{local-name()}">
<xsl:value-of select="."/>
</xsl:element>
</xsl:for-each>
</address>
</root>
</xsl:template>
</xsl:transform>
SAS(使用上述脚本)
proc xsl
in="C:\Path\To\Original.xml"
out="C:\Path\To\Output.xml"
xsl="C:\Path\To\XSLT.xsl";
run;
** STORING XML CONTENT;
libname temp xml 'C:\Path\To\Output.xml';
** APPEND CONTENT TO SAS DATASETS;
data Work.Coredata;
retain authorid;
set temp.Coredata; ** NAME OF PARENT NODE IN XML;
run;
data Work.SubjectAreas;
retain authorid;
set temp.SubjectAreas; ** NAME OF PARENT NODE IN XML;
run;
data Work.Authorname;
retain authorid;
set temp.Authorname; ** NAME OF PARENT NODE IN XML;
run;
data Work.Classifications;
retain authorid;
set temp.Classifications; ** NAME OF PARENT NODE IN XML;
run;
data Work.Journals;
retain authorid;
set temp.Journals; ** NAME OF PARENT NODE IN XML;
run;
data Work.Ipdoc;
retain authorid;
set temp.Ipdoc; ** NAME OF PARENT NODE IN XML;
run;
XML OUTPUT (作为 Authorsdata 数据集导入一行和 40 个变量)
<?xml version="1.0" encoding="UTF-8"?>
<root>
<coredata>
<authorid>1234567</authorid>
<url>http://api.elsevier.com/content/author/author_id/1234567</url>
<identifier>AUTHOR_ID:1234567</identifier>
<eid>9-s2.0-1234567</eid>
<document-count>3</document-count>
<cited-by-count>95</cited-by-count>
<citation-count>97</citation-count>
<link>http://api.elsevier.com/content/search/scopus?query=refauid%1234567%29</link>
<link>http://www.scopus.com/authid/detail.url?partnerID=HzOxMe3b&authorId=1234567&origin=inward</link>
<link>http://api.elsevier.com/content/author/author_id/1234567</link>
<link>http://api.elsevier.com/content/search/scopus?query=au-id%281234567%29</link>
</coredata>
<subjectAreas>
<authorid>1234567</authorid>
<subject-area>Human-Computer Interaction</subject-area>
<subject-area>Control and Systems Engineering</subject-area>
<subject-area>Software</subject-area>
<subject-area>Computer Vision and Pattern Recognition</subject-area>
<subject-area>Artificial Intelligence</subject-area>
</subjectAreas>
<authorname>
<authorid>1234567</authorid>
<initials>A.</initials>
<indexed-name>John A.</indexed-name>
<surname>John</surname>
<given-name>Doe</given-name>
</authorname>
<classifications>
<authorid>1234567</authorid>
<classification>1709</classification>
<classification>2207</classification>
<classification>1712</classification>
<classification>1707</classification>
<classification>1702</classification>
</classifications>
<journals>
<authorid>1234567</authorid>
<sourcetitle>Very Prestigious Journal</sourcetitle>
<sourcetitle-abbrev>V PRES JOU Autom</sourcetitle-abbrev>
<issn>10504729</issn>
<sourcetitle>2005 Another Prestigious Journal</sourcetitle>
<sourcetitle-abbrev>An. Prest. Jou. </sourcetitle-abbrev>
</journals>
<ipdoc>
<authorid>1234567</authorid>
<afnameid>Prestigious University#1111111</afnameid>
<afdispname>Prestigious University University</afdispname>
<preferred-name>Prestigious University University</preferred-name>
<sort-name>Prestigious University</sort-name>
<org-domain>pu.edu</org-domain>
<org-URL>http://www.pu.edu/index.shtml</org-URL>
</ipdoc>
<address>
<authorid>1234567</authorid>
<address-part>1234 Prestigious Lane</address-part>
<city>City</city>
<state>ST</state>
<postal-code>12345</postal-code>
<country>United States</country>
</address>
</root>
替代品
由于不存在全面的 R XSLT 库,因此必须直接用 R 语言进行解析。但是,R 可以通过命令行、 RCOMClient包和其他接口调用其他可执行文件(即 Python、Saxon、VBA)的 XSLT 处理器。
xmlToDataFrame()
尽管如此,R 可以通过and xpathSApply()
(后者类似于XPath)为提取 XML 数据authorid
:
library(XML)
coredata <- xmlToDataFrame(nodes = getNodeSet(doc, '//coredata'))
coredata$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])
subjectareas <- xmlToDataFrame(nodes = getNodeSet(doc, "//subject-areas"))
subjectareas$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])
authorname <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/preferred-name'))
authorname$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])
classifications <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/classificationgroup/classifications'))
classifications$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])
journal <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/journal-history/journal'))
journal$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])
ipdoc <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/affiliation-current/affiliation/ip-doc'))
ipdoc$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])
address <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/affiliation-current/affiliation/ip-doc/address'))
address$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "",
xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])