2

我想从 XML 文件中提取信息并将其转换为数据框。

信息以 XML 文本和 XML 属性的形式存储在嵌套节点中:

一个示例结构:

<xmlnode node-id = "Text about xmlnode">
    <xmlsubnode subnode-id = "123">
        <xmlsubsubnode>
            I want to extract this text
        </xmlsubsubnode>
        <xmlsubsubnode>
            I want to extract this text
        </xmlsubsubnode>    
        <xmlsubsubnode>
            I want to extract this text
        </xmlsubsubnode>    
        <xmlsubsubnode>
            I want to extract this text
        </xmlsubsubnode>    
    </xmlsubnode>
    <xmlsubnode subnode-id = "456">
        <xmlsubsubnode>
            I want to extract this text
        </xmlsubsubnode>
        <xmlsubsubnode>
            I want to extract this text
        </xmlsubsubnode>    
        <xmlsubsubnode>
            I want to extract this text
        </xmlsubsubnode>    
        <xmlsubsubnode>
            I want to extract this text
        </xmlsubsubnode>    
    </xmlsubnode>
</xmlnode>
<xmlnode node-id = "Text about xmlnode">
    <xmlsubnode subnode-id = "123">
        <xmlsubsubnode>
            I want to extract this text
        </xmlsubsubnode>
        <xmlsubsubnode>
            I want to extract this text
        </xmlsubsubnode>    
        <xmlsubsubnode>
            I want to extract this text
        </xmlsubsubnode>    
        <xmlsubsubnode>
            I want to extract this text
        </xmlsubsubnode>    
    </xmlsubnode>
    <xmlsubnode subnode-id = "456">
        <xmlsubsubnode>
            I want to extract this text
        </xmlsubsubnode>
        <xmlsubsubnode>
            I want to extract this text
        </xmlsubsubnode>    
        <xmlsubsubnode>
            I want to extract this text
        </xmlsubsubnode>    
        <xmlsubsubnode>
            I want to extract this text
        </xmlsubsubnode>    
    </xmlsubnode>
</xmlnode>

我想得到这些信息:

* node-id (attribute)
* subnode-id (attribute)
* text in `xmlsubnodenode` (text)

我需要一个像这样的长格式数据框:

node-id subnode-id  text
Text about xmlnode 1    123 I want to extract this text
Text about xmlnode 1    123 I want to extract this text
Text about xmlnode 1    123 I want to extract this text
Text about xmlnode 1    123 I want to extract this text
Text about xmlnode 1    456 I want to extract this text
Text about xmlnode 1    456 I want to extract this text
Text about xmlnode 1    456 I want to extract this text
Text about xmlnode 1    456 I want to extract this text
Text about xmlnode 2    123 I want to extract this text
Text about xmlnode 2    123 I want to extract this text
Text about xmlnode 2    123 I want to extract this text
Text about xmlnode 2    123 I want to extract this text
Text about xmlnode 2    456 I want to extract this text
Text about xmlnode 2    456 I want to extract this text
Text about xmlnode 2    456 I want to extract this text
Text about xmlnode 2    456 I want to extract this text

我尝试遵循 Jenny Bryans 的方法“如何使用嵌套数据框和 purrr 驯服 XML”,但它只适用于第一级。

xml <- xml2::read_xml("input/example.xml")
rows <- 
  xml %>%
  xml_find_all("//xmlnode")
rows_df <- data_frame(row = seq_along(rows), nodeset = rows)
rows_df %>%
  mutate(node_id = nodeset %>% map(~ xml_attr(., "node-id"))) %>%
  select(row, node_id) %>%
  unnest()

你有想法来获取这些信息purrr吗?

4

1 回答 1

5

一种无需展开/将行添加到另一个数据框的方法:为每个数据框创建一个包含一行的数据框,subsubnode并使用purrrwithxml2来选择和提取xmlsubnode父级和xmlnode祖先的值。

工作样本:

library(dplyr)
library(xml2)
library(purrr)
library(tidyr)
xml <- xml2::read_xml("input/example.xml")
rows <- xml %>% xml_find_all("//xmlsubsubnode")
rows_df <- data_frame(node = rows) %>%
  mutate(node_id = node %>% map(~ xml_find_first(., "ancestor::xmlnode")) %>% map(~ xml_attr(., "node-id"))) %>%
  mutate(subnode_id = node %>% map(~ xml_parent(.)) %>% map(~ xml_attr(., "subnode-id"))) %>%
  mutate(text = node %>% map(~ xml_text(.))) %>%
  select(-node)
于 2018-03-16T21:08:47.987 回答