So, I have large XML file with lots of reports. I created data example below to approximately show the size of xml and its structure:
x <- "<Report><Agreements><AgreementList /></Agreements><CIP><RecordList><Record><Date>2017-05-26T00:00:00</Date><Grade>2</Grade><ReasonsList><Reason><Code>R</Code><Description>local</Description></Reason></ReasonsList><Score>xxx</Score></Record><Record><Date>2017-04-30T00:00:00</Date><Grade>2</Grade><ReasonsList><Reason><Code>R</Code><Description/></Reason></ReasonsList><Score>xyx</Score></Record></RecordList></CIP><Individual><Contact><Email/></Contact><General><FirstName>MM</FirstName></General></Individual><Inquiries><InquiryList><Inquiry><DateOfInquiry>2017-03-19</DateOfInquiry><Reason>cc</Reason></Inquiry><Inquiry><DateOfInquiry>2016-10-14</DateOfInquiry><Reason>er</Reason></Inquiry></InquiryList><Summary><NumberOfInquiries>2</NumberOfInquiries></Summary></Inquiries></Report>"
x <- paste(rep(x, 1.5e+5), collapse = "")
x <- paste0("<R>", x, "</R>")
require(XML)
p <- xmlParse(x)
p <- xmlRoot(p)
p[[1]]
I would like to transform this data to data.frame, but the structure of XML isn't straightforward. Previously working with XMLs I created loop that for every report transforms its sub nodes to data.frame, but here (in this data) the sub node count is greater than 30 (didn't put all of them in the example), and the structure differs (List nodes can occur even 2 levels deep in XML).
So I have few questions:
1) I am sure that looping over reports isn't the best way to handle this. How should I approach this problem?
2) Can I somehow extract all the data of one report two one line of data.frame (recursively maybe)?
3) Or can I automatically create separate data.frames for each list object of XML?
Any help would be much appreciated.
Update:
Example of results could look like this:
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1 obs. of 17 variables:
$ Record.1.Date : chr "2017-05-26T00:00:00"
$ Record.1.Grade : num 2
$ Record.1.Reason.1.Code : chr "R"
$ Record.1.Reason.1.Description: chr "local"
$ Record.1.Score : chr "xxx"
$ Record.2.Date : chr "2017-05-26T00:00:00"
$ Record.2.Grade : num 2
$ Record.2.Reason.1.Code : chr "R"
$ Record.2.Reason.1.Description: chr "NA"
$ Record.2.Score : chr "xyx"
$ Email.1 : chr "NA"
$ FirstName : chr "MM"
$ Inquiry.1.DateOfInquiry : POSIXct, format: "2017-03-19"
$ Inquiry.1.Reason : chr "cc"
$ Inquiry.2.DateOfInquiry : POSIXct, format: "2016-10-14"
$ Inquiry.2.Reason : chr "er"
$ NumberOfInquiries : num 2
, but as I mentioned previously, sub lists could also be in separate tables.