6

我想知道是否有人设法将 SDMX-XML 文件读入数据帧。我想阅读的文件是https://www.ecb.europa.eu/stats/sdmx/icpf/1/data/pension_funds.xml (1mb)。我将文件作为“pensions_funds.xml”保存到 pwd 并尝试使用 XML 包读取它:

fileName <- system.file("pensions", "pensions_funds.xml", package="XML")
parsed<-xmlTreeParse("pension_funds.xml",getDTD=F)
r<-xmlRoot(parsed)
tmp = xmlSApply(r, function(x) xmlSApply(x, xmlValue))

上面的几行基本上遵循这里的示例http://www.omegahat.org/RSXML/gettingStarted.html 但我想我首先需要以某种方式忽略标题(我已粘贴在文件的前几页下面我'正在尝试阅读)。所以我认为上述方法可能有效,但出于我的目的,它从错误的节点开始。我想获取由 time_period 和 ref_area 索引的 obs_values。

首先要找到正确的节点并从那里开始,但是我怀疑我可能是在做傻事,因为我对数据格式的了解有限,而且我不确定 XML 包是否可用于 SDMX-XML 文件。更聪明的人似乎已经尝试过这样做 http://opensdmxdevelopers.wikispaces.com/RSDMX 我在这里的主页上找不到这个包供下载 https://r-forge.r-project.org/projects/rsdmx / (我看不到任何链接/下载部分,但也许我是盲人)而且似乎还处于早期阶段。rsdmx 的存在表明使用 xml 包来读取 sdmx 可能并不容易,所以我准备在这个阶段放弃,除非有人在这方面取得了成功。其实我主要对阅读这个文件感兴趣 http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml 但这是一个 10mb 的文件,所以我开始变小了。

edit3 使用 Mischa 的评论库(“XML”)中的更改尝试 sgibb 对大文件的回答

url <- "http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml"

    sdmxHandler <- function() {
  ## data.frame which stores results
  data <- data.frame(stringsAsFactors=FALSE)
  ## counter to store current row
  i <- 1
  ## temp value to store current REF_AREA
  ## temp value to store current REF_AREA
  refArea <- NA
  bsItem <- NA
  bsCountSector <- NA

  ## handler subroutine for Obs tag
  Obs <- function(name, attr) {
    ## found an Obs tag and now fill data.frame
    data[i, "refArea"] <<- refArea
    data[i, "timePeriod"] <<- as.numeric(attr["TIME_PERIOD"])
    data[i, "obsValue"] <<- as.numeric(attr["OBS_VALUE"])
    data[i, "bsItem"] <<- bsItem
    data[i, "bsCountSector"] <<- bsCountSector
    i <<- i + 1
  }

  ## handler subroutine for Series tag
  Series <- function(name, attr) {
    refArea <<- attr["REF_AREA"]
    bsItem <<- as.character(attr["BS_ITEM"])
    bsCountSector <<- as.numeric(attr["BS_ITEM"])
  }
  return(list(getData=function() {return(data)},
              Obs=Obs, Series=Series))
}

## run parser
df <- xmlEventParse(file(url), handlers=sdmxHandler())$getData()
Specification mandate value for attribute OBS_VALUE
attributes construct error
Couldn't find end of Start Tag Obs line 15108
Premature end of data in tag Series line 15041
Premature end of data in tag DataSet line 91
Premature end of data in tag CompactData line 2
Error: 1: Specification mandate value for attribute OBS_VALUE
2: attributes construct error
3: Couldn't find end of Start Tag Obs line 15108
4: Premature end of data in tag Series line 15041
5: Premature end of data in tag DataSet line 91
6: Premature end of data in tag CompactData line 2
In addition: There were 50 or more warnings (use warnings() to see the first 50)

edit2: sgibb 的答案看起来很理想,并且在较小的文件上完美运行。我试图运行它

url <- http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml

(10mb 文件,原始链接已更正),唯一的修改是添加了两行:

data[i, "bsItem"] <<- as.character(attr["BS_ITEM"])

data[i, "bsCountSector"] <<- as.numeric(attr["BS_COUNT_SECTOR"])

(这些是在这个更大的数据集中识别一行所需的额外 id 变量)。它运行了几分钟,然后出现此错误:

错误:1:属性 TIME_PE 的规范要求值
2:属性构造错误
3:找不到开始标签 Obs 行 20743的结尾
4:标签系列行 20689
中的数据过早结束 5:标签数据集第 91 行中的数据过早结束6:标签 CompactData 第 2 行中的数据提前结束

另外:有 50 个或更多警告(使用 warnings() 查看前 50 个)

数据的基本格式似乎非常相似,所以我认为这可能有效。10mb文件的基本格式如下:

    <Series FREQ="M" REF_AREA="AT" ADJUSTMENT="N" BS_REP_SECTOR="A" BS_ITEM="A20" MATURITY_ORIG="A" DATA_TYPE="1" COUNT_AREA="U2" BS_COUNT_SECTOR="0000" CURRENCY_TRANS="Z01" BS_SUFFIX="E" TIME_FORMAT="P1M" COLLECTION="E">
        <Obs TIME_PERIOD="1997-09" OBS_VALUE="275.3" OBS_STATUS="A" OBS_CONF="F"/>
        <Obs TIME_PERIOD="1997-10" OBS_VALUE="275.9" OBS_STATUS="A" OBS_CONF="F"/>
        <Obs TIME_PERIOD="1997-11" OBS_VALUE="276.6" OBS_STATUS="A" OBS_CONF="F"/>

编辑1:

所需的数据格式:

Ref_area    time_period obs_value

At  2006    118    
At  2007    119    
…    
Be  2006    101    
…

这是数据的第一位。

    </Header>
    DataSet xsi:schemaLocation="https://www.ecb.europa.eu/vocabulary/stats/icpf/1 https://www.ecb.europa.eu/stats/sdmx/icpf/1/structure/2011-08-11/sdmx-compact.xsd" xmlns="https://www.ecb.europa.eu/vocabulary/stats/icpf/1"> 
<Group DECIMALS="0" TITLE_COMPL="Austria, reporting institutional sector Insurance corporations and pension funds - Closing balance sheet - All financial assets and liabilities - counterpart area World (all entities), counterpart institutional sector Total economy including Rest of the World (all sectors) - Credit (resources/liabilities) - Non-consolidated, Current prices - Euro, Neither seasonally nor working day adjusted - ESA95 TP table Not applicable" UNIT_MULT="9" UNIT="EUR" ESA95TP_SUFFIX="Z" ESA95TP_DENOM="E" ESA95TP_CONS="N" ESA95TP_DC_AL="2" ESA95TP_CPSECTOR="S" ESA95TP_CPAREA="A1" ESA95TP_SECTOR="S125" ESA95TP_ASSET="F" ESA95TP_TRANS="LE" ESA95TP_PRICE="V" ADJUSTMENT="N" REF_AREA="AT"/><Series ESA95TP_SUFFIX="Z" ESA95TP_DENOM="E" ESA95TP_CONS="N" ESA95TP_DC_AL="2" ESA95TP_CPSECTOR="S" ESA95TP_CPAREA="A1" ESA95TP_SECTOR="S125" ESA95TP_ASSET="F" ESA95TP_TRANS="LE" ESA95TP_PRICE="V" ADJUSTMENT="N" REF_AREA="AT" COLLECTION="E" TIME_FORMAT="P1Y" FREQ="A"><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="112" TIME_PERIOD="2008"/><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="119" TIME_PERIOD="2009"/><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="125" TIME_PERIOD="2010"/><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="127" TIME_PERIOD="2011"/></Series><Group D
4

3 回答 3

5

RSDMX似乎处于早期开发状态。恕我直言,还没有可用的软件包。但是您可以使用该XML软件包轻松地自行实现它。我建议使用xmlEventParse(详见?xmlEventParse):

编辑:使示例适应未完成金额.xml 的更改要求
EDIT2:添加download.file

library("XML")

#url <- "http://www.ecb.europa.eu/stats/sdmx/icpf/1/data/pension_funds.xml"
url <- "http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml"

## download xml file to avoid download errors disturbing xmlEventParse
tmp <- tempfile()
download.file(url, tmp) 

sdmxHandler <- function() {
  ## data.frame which stores results
  data <- data.frame(stringsAsFactors=FALSE)
  ## counter to store current row
  i <- 1
  ## temp value to store current REF_AREA, BS_ITEM and BS_COUNT_SECTOR
  refArea <- NA
  bsItem <- NA
  bsCountSector <- NA

  ## handler subroutine for Obs tag
  Obs <- function(name, attr) {
    ## found an Obs tag and now fill data.frame
    data[i, "refArea"] <<- refArea
    data[i, "bsItem"] <<- bsItem
    data[i, "bsCountSector"] <<- bsCountSector
    data[i, "timePeriod"] <<- as.Date(paste(attr["TIME_PERIOD"], "-01", sep=""), format="%Y-%m-%d")
    data[i, "obsValue"] <<- as.double(attr["OBS_VALUE"])
    ## update current row
    i <<- i + 1
  }

  ## handler subroutine for Series tag
  Series <- function(name, attr) {
    refArea <<- attr["REF_AREA"]
    bsItem <<- attr["BS_ITEM"]
    bsCountSector <<- as.numeric(attr["BS_COUNT_SECTOR"])
  }

  return(list(getData=function() {return(data)},
              Obs=Obs, Series=Series))
}

## run parser
df <- xmlEventParse(tmp, handlers=sdmxHandler())$getData()

head(df)
#  refArea bsItem bsCountSector timePeriod obsValue
#1      DE    A20          2210      12053     39.6
#2      DE    A20          2210      12084     46.1
#3      DE    A20          2210      12112     50.2
#4      DE    A20          2210      12143     52.0
#5      DE    A20          2210      12173     52.3
#6      DE    A20          2210      12204     47.3
于 2012-08-13T13:23:14.857 回答
3

rsdmx允许您读取 SDMX-ML 文件并将它们强制转换为data.frame. 它现在托管在Github上,目前在 CRAN 中可用,但如果您可以从 GitHub 轻松安装它,请使用以下命令:

require("devtools")
install_github("rsdmx", "opensdmx")

应用到您的数据,您可以执行以下操作:

sdmx <- readSDMX("http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml")
df <- as.data.frame(sdmx)

rsdmx wiki中提供了更多示例

请注意,它的功能当前将 xml 对象加载到 R 中,作为SDMXrsdmx 实例化的 R 对象的插槽部分。将来,我们想研究 rsdmx 如何使用xmlEventParse(如上面@sgibb 所建议的那样)来读取非常大的数据集。

于 2014-09-02T10:38:45.813 回答
0
library(XML)

xmlparsed <- xmlParse(file(url))

## obtain dataset node::
series_data <- getNodeSet(xmlparsed, "//Series")

if(length(series_data)==0){

datasetnode <- xmlChildren( xmlChildren(xmlparsed)[[1]])[[2]]

series_data<-xmlChildren(datasetnode)[ names(xmlChildren(datasetnode))=="Series"]

}

## prepare dataset

dataset.frame <- data.frame(matrix(ncol=3))
colnames(dataset.frame) <- c('REF_AREA', 'TIME_PERIOD', 'OBS_VALUE')
## loop over data

counter=1
for (i in 1: length(series_data)){
  if('Obs'%in%names(xmlChildren(series_data[[i]])) ){ ## To ignore empty //Series nodes
    for (j in 1: length(xmlChildren(series_data[[i]]))){
      dataset.frame[counter,1] <- xmlAttrs(series_data[[i]])['REF_AREA']
      dataset.frame[counter,2] <- xmlAttrs(series_data[[i]][[j]])['TIME_PERIOD']
      dataset.frame[counter,3] <- xmlAttrs(series_data[[i]][[j]])['OBS_VALUE']
      counter=counter+1
    }
  }
}


head(dataset.frame,5)
于 2012-08-13T14:17:46.797 回答