r - R＆xml2：通过特定文本值定位元素，将所有子值存储在data.frame中

Question

我使用定期刷新的 XML 报告，我想使用 R & xml2 自动化处理过程。

<?xml version="1.0" ?>
<riDetailEnrolleeReport xmlns="http://vo.edge.fm.cms.hhs.gov">
    <includedFileHeader>
        <outboundFileIdentifier>f2e55625-e70e-4f9d-8278-fc5de7c04d47</outboundFileIdentifier>
        <cmsBatchIdentifier>RIP-2015-00096</cmsBatchIdentifier>
        <cmsJobIdentifier>16220</cmsJobIdentifier>
        <snapShotFileName>25032.BACKUP.D03152016T032051.dat</snapShotFileName>
        <snapShotFileHash>20d887c9a71fa920dbb91edc3d171eb64a784dd6</snapShotFileHash>
        <outboundFileGenerationDateTime>2016-03-15T15:20:54</outboundFileGenerationDateTime>
        <interfaceControlReleaseNumber>04.03.01</interfaceControlReleaseNumber>
        <edgeServerVersion>EDGEServer_14.09_01_b0186</edgeServerVersion>
        <edgeServerProcessIdentifier>8</edgeServerProcessIdentifier>
        <outboundFileTypeCode>RIDE</outboundFileTypeCode>
        <edgeServerIdentifier>2800273</edgeServerIdentifier>
        <issuerIdentifier>25032</issuerIdentifier>
    </includedFileHeader>
    <calendarYear>2015</calendarYear>
    <executionType>P</executionType>
    <includedInsuredMemberIdentifier>
        <insuredMemberIdentifier>ARS001</insuredMemberIdentifier>
        <memberMonths>12.13</memberMonths>
        <totalAllowedClaims>1000.00</totalAllowedClaims>
        <totalPaidClaims>100.00</totalPaidClaims>
        <moopAdjustedPaidClaims>100.00</moopAdjustedPaidClaims>
        <cSRMOOPAdjustment>0.00</cSRMOOPAdjustment>
        <estimatedRIPayment>0.00</estimatedRIPayment>
        <coinsurancePercentPayments>0.00</coinsurancePercentPayments>
        <includedPlanIdentifier>
            <planIdentifier>25032VA013000101</planIdentifier>
            <includedClaimIdentifier>
                <claimIdentifier>CADULT4SM00101</claimIdentifier>
                <claimPaidAmount>100.00</claimPaidAmount>
                <crossYearClaimIndicator>N</crossYearClaimIndicator>
            </includedClaimIdentifier>
        </includedPlanIdentifier>
    </includedInsuredMemberIdentifier>
    <includedInsuredMemberIdentifier>
        <insuredMemberIdentifier>ARS002</insuredMemberIdentifier>
        <memberMonths>9.17</memberMonths>
        <totalAllowedClaims>0.00</totalAllowedClaims>
        <totalPaidClaims>0.00</totalPaidClaims>
        <moopAdjustedPaidClaims>0.00</moopAdjustedPaidClaims>
        <cSRMOOPAdjustment>0.00</cSRMOOPAdjustment>
        <estimatedRIPayment>0.00</estimatedRIPayment>
        <coinsurancePercentPayments>0.00</coinsurancePercentPayments>
        <includedPlanIdentifier>
            <planIdentifier>25032VA013000101</planIdentifier>
            <includedClaimIdentifier>
                <claimIdentifier></claimIdentifier>
                <claimPaidAmount>0</claimPaidAmount>
                <crossYearClaimIndicator>N</crossYearClaimIndicator>
            </includedClaimIdentifier>
        </includedPlanIdentifier>
    </includedInsuredMemberIdentifier>
</riDetailEnrolleeReport>

我想：

将 XML 读入 R
找到特定的被保险人标识符
提取（2）中与成员ID关联的planIdentifier和所有claimIdentifier数据
将保险会员标识符、计划标识符、索赔标识符和索赔支付金额的所有文本和值存储在 data.frame 中，每个唯一索赔 ID 对应一行（会员 ID 到索赔 ID 是一对多）

到目前为止，我已经完成了 1 并且我在 2 的球场上：

## Step 1 ##
ride <- read_xml("/Users/temp/Desktop/RIDetailEnrolleeReport.xml")

## Step 2 -- assume the insuredMemberIdentifier of interest is 'ARS001' ##
memID <- xml_find_all(ride, "//d1:insuredMemberIdentifier[text()='ARS001']", xml_ns(ride))

[我知道我可以使用它xml_text()来提取元素的文本。]

在上面第 2 步中的代码之后，我尝试使用xml_parent()来定位被保险人标识符的父节点，将其保存为变量，然后重复第 2 步以获取该已保存变量节点上的索赔信息。

node <- xml_parent(memID)
xml_find_all(node, "//d1:claimIdentifier", xml_ns(ride))

但这只会导致拉取全局文件中的所有声明标识符。

任何有关如何进入上述第 4 步的帮助/信息将不胜感激。先感谢您。

score 0 · Accepted Answer

对迟到的响应表示歉意，但为了后代，请使用xml2如上所述导入数据，然后按 har07 的提示按 ID 解析 xml 文件。

# output object to collect all claims
res <- data.frame(
    insuredMemberIdentifier = rep(NA, 1), 
    planIdentifier = NA, 
    claimIdentifier = NA, 
    claimPaidAmount = NA)
# vector of ids of interest
ids <- c('ARS001')
# indexing counter
starti <- 1
# loop through all ids
for (ii in seq_along(ids)) {
    # find ii-th id
    ## Step 2 -- assume the insuredMemberIdentifier of interest is 'ARS001' ##
    memID <- xml_find_all(x = ride, 
        xpath = paste0("//d1:insuredMemberIdentifier[text()='", ids[ii], "']"))
    # find node for 
    node <- xml_parent(memID)
    # as har07's comment find claim id within this node
    cid <- xml_find_all(node, ".//d1:claimIdentifier", xml_ns(ride))
    pid <- xml_find_all(node, ".//d1:planIdentifier", xml_ns(ride))
    cpa <- xml_find_all(node, ".//d1:claimPaidAmount", xml_ns(ride))
    # add invalid data handling if necessary
    if (length(cid) != length(cpa)) {
        warning(paste("cid and cpa do not match for", ids[ii]))
        next
    }
    # collect outputs 
    res[seq_along(cid) + starti - 1, ] <- list(
        ids[ii], 
        xml_text(pid),
        xml_text(cid),
        xml_text(cpa))
    # adjust counter to add next id into correct row
    starti <- starti + length(cid)
}
res
#   insuredMemberIdentifier   planIdentifier claimIdentifier claimPaidAmount
# 1                  ARS001 25032VA013000101  CADULT4SM00101          100.00

r - R＆xml2：通过特定文本值定位元素，将所有子值存储在data.frame中

1 回答 1

Related

Reference