r - 根据相关节点的属性和文本值解析 XML

Question

我之前使用过 XML 包来解析 HTML 和 XML，并且对 xPath 有初步的了解。然而，我被要求考虑 XML 数据，其中重要位由元素本身的文本和属性以及相关节点中的属性组合确定。我从来没有这样做过。例如

[更新的示例，稍微更广泛]

<Catalogue>
<Bookstore id="ID910705541">
  <location>foo bar</location>
  <books>
    <book category="A" id="1">
        <title>Alpha</title>
        <author ref="1">Matthew</author>
        <author>Mark</author>
        <author>Luke</author>
        <author ref="2">John</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="B" id="10">
        <title>Beta</title>
        <author ref="1">Huey</author>
        <author>Duey</author>
        <author>Louie</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="D" id="100">
        <title>Gamma</title>
        <author ref="1">Tweedle Dee</author>
        <author ref="2">Tweedle Dum</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
  </books>
  </Bookstore> 
<Bookstore id="ID910700051">
  <location>foo</location>
  <books>
    <book category="A" id="1">
        <title>Happy</title>
        <author>Dopey</author>
        <author>Bashful</author>
        <author>Doc</author>
        <author ref="1">Grumpy</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="B" id="10">
        <title>Ni</title>
        <author ref="1">John</author>
        <author ref="2">Paul</author>
        <author ref="3">George</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="D" id="100">
        <title>San</title>
        <author ref="1">Ringo</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
  </books>
 </Bookstore> 
<Bookstore id="ID910715717">
    <location>bar</location>
  <books>
    <book category="A" id="1">
        <title>Un</title>
        <author ref="1">Winkin</author>
        <author>Blinkin</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="B" id="10">
        <title>Deux</title>
        <author>Nod</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="D" id="100">
        <title>Trois</title>
        <author>Manny</author>
        <author>Moe</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
  </books>
 </Bookstore> 
</Catalogue>

我想提取所有作者姓名，其中：1）位置元素的文本值包含“NY”2）作者元素不包含“ref”属性；这就是作者标签中不存在 ref 的地方

我最终需要在给定的书店中将提取的作者连接在一起，以便我得到的数据框是每个书店一行。我想将书店 ID 作为附加字段保留在我的数据框中，以便我可以唯一地引用每个商店。由于只有第一家 bokstore 在纽约，因此这个简单示例的结果如下所示：

1 Jane Smith John Doe Karl Pearson William Gosset

如果另一家书店在其位置包含“NY”，则它将包括第二行，依此类推。

在这些令人费解的条件下，我是否要求过多的 R 来解析？

score 3 · Accepted Answer

require(XML)

xdata <- xmlParse(apptext)
xpathSApply(xdata,'//*/location[text()[contains(.,"NY")]]/following-sibling::books/.//author[not(@ref)]')
#[[1]]
#<author>Jane Smith</author> 

#[[2]]
#<author>John Doe</author> 

#[[3]]
#<author>Karl Pearson</author> 

#[[4]]
#<author>William Gosset</author>

分解：

获取所有包含“NY”的位置

//*/location[text()[contains(.,"NY")]]

获取这些节点的书籍兄弟

/following-sibling::books

从这些注释中获取所有没有 ref 属性的作者

/.//author[not(@ref)]

如果需要文本，请使用 xmlValue：

> xpathSApply(xdata,'//*/location[text()[contains(.,"NY")]]/following-sibling::books/.//author[not(@ref)]',xmlValue)
[1] "Jane Smith"     "John Doe"       "Karl Pearson"   "William Gosset"

更新：

child.nodes <- xpathSApply(xdata,'//*/location[text()[contains(.,"NY")]]/following-sibling::books/.//author[not(@ref)]')

ans.func<-function(x){
    xpathSApply(x,'.//ancestor::bookstore[@id]/@id')
}

sapply(child.nodes,ans.func)
# id  id  id  id 
#"1" "1" "1" "1"

更新 2：

使用您更改的数据

xdata <- '<Catalogue>
<Bookstore id="ID910705541">
  <location>foo bar</location>
  <books>
    <book category="A" id="1">
        <title>Alpha</title>
        <author ref="1">Matthew</author>
        <author>Mark</author>
        <author>Luke</author>
        <author ref="2">John</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="B" id="10">
        <title>Beta</title>
        <author ref="1">Huey</author>
        <author>Duey</author>
        <author>Louie</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="D" id="100">
        <title>Gamma</title>
        <author ref="1">Tweedle Dee</author>
        <author ref="2">Tweedle Dum</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
  </books>
  </Bookstore> 
<Bookstore id="ID910700051">
  <location>foo</location>
  <books>
    <book category="A" id="1">
        <title>Happy</title>
        <author>Dopey</author>
        <author>Bashful</author>
        <author>Doc</author>
        <author ref="1">Grumpy</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="B" id="10">
        <title>Ni</title>
        <author ref="1">John</author>
        <author ref="2">Paul</author>
        <author ref="3">George</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="D" id="100">
        <title>San</title>
        <author ref="1">Ringo</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
  </books>
 </Bookstore> 
<Bookstore id="ID910715717">
    <location>bar</location>
  <books>
    <book category="A" id="1">
        <title>Un</title>
        <author ref="1">Winkin</author>
        <author>Blinkin</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="B" id="10">
        <title>Deux</title>
        <author>Nod</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="D" id="100">
        <title>Trois</title>
        <author>Manny</author>
        <author>Moe</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
  </books>
 </Bookstore> 
</Catalogue>'

注意以前你bookstore现在有Bookstore。NY不见了，所以我用过foo

require(XML)
xdata <- xmlParse(xdata)
child.nodes <- getNodeSet(xdata,'//*/location[text()[contains(.,"foo")]]/following-sibling::books/.//author[not(@ref)]')

ans.func<-function(x){
  xpathSApply(x,'.//ancestor::Bookstore[@id]/@id')
}

sapply(child.nodes,ans.func)
#           id            id            id            id            id 
#"ID910705541" "ID910705541" "ID910705541" "ID910705541" "ID910700051" 
#           id            id 
#"ID910700051" "ID910700051"

xpathSApply(xdata,'//*/location[text()[contains(.,"foo")]]/following-sibling::books/.//author[not(@ref)]',xmlValue)
# [1] "Mark"    "Luke"    "Duey"    "Louie"   "Dopey"   "Bashful" "Doc"

r - 根据相关节点的属性和文本值解析 XML

1 回答 1

Related

Reference