Background:
I'm somewhat familiar with parsing XML with Java via the DOM.
What I'm trying to do:
I'm trying to parse an HL7 / XML Structured Product Label from NLM Daily Med website. An example url of what I am trying to parse is : Atenolol SPL
What I've tried so far:
I've tried DOM, ElementTree, lxml, and minidom. The closest I've been able to come has been using this code:
#!/usr/bin/python3
import xml.sax
from xml.dom.minidom import parse
import xml.dom.minidom
# ------Using SAX Parser---------------
class MovieHandler(xml.sax.ContentHandler):
def __init__(self):
self.CurrentData = ""
self.type = ""
self.title = ""
self.text = ""
self.description = ""
self.displayName = ""
# Call when an element starts
def startElement(self, tag, attributes):
self.CurrentData = tag
if tag == "code":
print ("*****Section*****")
code = attributes["code"]
#displayName = attributes["displayName"]
print ("Code:", code)
#print("Display Name:", displayName)
# Call when an elements ends
def endElement(self, tag):
if self.CurrentData == "type":
print ("Type:", self.type)
elif self.CurrentData == "displayName":
print("Display Name:", self.displayName)
elif self.CurrentData == "title":
print ("Title:", self.CurrentData.title())
elif self.CurrentData == "text":
print ("Text:", self.text)
elif self.CurrentData == "description":
print ("Description:", self.description)
self.CurrentData = ""
# Call when a character is read
def characters(self, content):
if self.CurrentData == "type":
self.type = content
elif self.CurrentData == "format":
self.format = content
elif self.CurrentData == "year":
self.year = content
elif self.CurrentData == "rating":
self.rating = content
elif self.CurrentData == "stars":
self.stars = content
elif self.CurrentData == "description":
self.description = content
if (__name__ == "__main__"):
# create an XMLReader
parser = xml.sax.make_parser()
# turn off namepsaces
parser.setFeature(xml.sax.handler.feature_namespaces, 0)
# override the default ContextHandler
Handler = MovieHandler()
parser.setContentHandler(Handler)
parser.parse(saved_file_path)
The results in console are:
*****Section***** Code: 34391-3 Title: Title
*****Section***** Code: 57664-264
*****Section***** Code: 50VV3VW0TI
*****Section***** Code: 50VV3VW0TI
*****Section***** Code: 368GB5141J
*****Section***** Code: 70097M6I30
*****Section***** Code: 57664-264-88
*****Section***** Code: 57664-264-13
*****Section***** Code: 57664-264-18
*****Section***** Code: SPLCOLOR
*****Section***** Code: SPLSHAPE
*****Section***** Code: SPLSCORE
*****Section***** Code: SPLSIZE
*****Section***** Code: SPLIMPRINT
*****Section***** Code: SPLCOATING
*****Section***** Code: SPLSYMBOL
*****Section***** Code: 57664-265
*****Section***** Code: 50VV3VW0TI
*****Section***** Code: 50VV3VW0TI
*****Section***** Code: 368GB5141J
*****Section***** Code: 70097M6I30
*****Section***** Code: 57664-265-88
*****Section***** Code: 57664-265-13
*****Section***** Code: 57664-265-18
*****Section***** Code: SPLCOLOR
*****Section***** Code: SPLSHAPE
*****Section***** Code: SPLSCORE
*****Section***** Code: SPLSIZE
*****Section***** Code: SPLIMPRINT
*****Section***** Code: SPLCOATING
*****Section***** Code: SPLSYMBOL
*****Section***** Code: 57664-266
*****Section***** Code: 50VV3VW0TI
*****Section***** Code: 50VV3VW0TI
*****Section***** Code: 368GB5141J
*****Section***** Code: 70097M6I30
*****Section***** Code: 57664-266-88
*****Section***** Code: 57664-266-13
*****Section***** Code: 57664-266-18
*****Section***** Code: SPLCOLOR
*****Section***** Code: SPLSHAPE
*****Section***** Code: SPLSCORE
*****Section***** Code: SPLSIZE
*****Section***** Code: SPLIMPRINT
*****Section***** Code: SPLCOATING
*****Section***** Code: SPLSYMBOL
*****Section***** Code: 34066-1 Title: Title Title: Title
*****Section***** Code: 34089-3 Title: Title
*****Section***** Code: 34090-1 Title: Title Title: Title Title: Title Title: Title
*****Section***** Code: 34067-9 Title: Title Title: Title Title: Title Title: Title
*****Section***** Code: 34070-3 Title: Title
*****Section***** Code: 34071-1 Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title
*****Section***** Code: 42232-9 Title: Title
*****Section***** Code: 34072-9 Title: Title
*****Section***** Code: 34073-7 Title: Title
*****Section***** Code: 34083-6 Title: Title
*****Section***** Code: 34091-9 Title: Title
*****Section***** Code: 42228-7 Title: Title
*****Section***** Code: 34080-2 Title: Title
*****Section***** Code: 34081-0 Title: Title
*****Section***** Code: 34082-8 Title: Title Title: Title Title: Title
*****Section***** Code: 34084-4 Title: Title Title: Title Title: Title Text:
*****Section***** Code: 34088-5 Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title Text:
*****Section***** Code: 34068-7 Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title
*****Section***** Code: 34069-5 Title: Title
Process finished with exit code 0
The issues / whats not working:
I dont really need the sections prior to the sections containing "Code: XXXXX-X"
For each of those sections I want to get the values for the <title>
, <text>
, and <paragraph>
tags for that section and all sub-sections of that section.
While I've been able to use the tutorials for DOM, ElementTree, lxml, and minidom, the target XML is non-standard and contains multiple attributes in a single tag, for example:
<code code="34090-1" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="Clinical Pharmacology section" />
And some nodes/elements will contain a shortcut end tag (as seen above) while others will have a full traditional end tag.
No wonder healthcare is so complicated!
So how do I get the contents of the tag and iterate over the subsections to do the same?