I receive a large XML file and often the XML file do not validate to schema file. Instead of droping the whole xml file I would like to remove the "invalid" content and save the rest of the XML file.
I'm using xmllint to validate the xml by this command:
xmllint -schema testSchedule.xsd testXML.xml
The XSD file (in this example named testSchedule.xsd):
<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" targetNamespace="http://www.testing.dk" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="MasterData">
<xs:complexType>
<xs:sequence>
<xs:element name="Items">
<xs:complexType>
<xs:sequence>
<xs:element name="Item" maxOccurs="unbounded" minOccurs="0">
<xs:complexType>
<xs:sequence>
<xs:element type="xs:integer" name="Id" minOccurs="1"/>
<xs:element type="xs:integer" name="Width" minOccurs="1"/>
<xs:element type="xs:integer" name="Height" minOccurs="0"/>
<xs:element type="xs:string" name="Remark"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
And the XML file (In this example named testXML.xml):
<?xml version="1.0" encoding="ISO-8859-1" ?>
<MasterData xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://www.testing.dk">
<Items>
<Item>
<Id>1</Id>
<Width>10</Width>
<Height>100</Height>
<Remark>This is OK</Remark>
</Item>
<Item>
<Id>2</Id>
<Width>20</Width>
<Height>200</Height>
<Remark>This is OK - But is missing Height a non mandatory field</Remark>
</Item>
<Item>
<Id>3</Id>
<Height>300</Height>
<Remark>This is NOT OK - Missing the mandatory Width</Remark>
</Item>
<Item>
<Id>4</Id>
<Width>TheIsAString</Width>
<Height>200</Height>
<Remark>This is NOT OK - Width is not an integer but a string</Remark>
</Item>
<Item>
<Id>5</Id>
<Width>50</Width>
<Height>500</Height>
<Remark>This is OK and the last</Remark>
</Item>
</Items>
</MasterData>
Then I get the this result of the xmllint command:
testXML.xml:18: element Height: Schemas validity error : Element '{http://www.testing.dk}Height': This element is not expected. Expected is ( {http://www.testing.dk}Width ).
testXML.xml:23: element Width: Schemas validity error : Element '{http://www.testing.dk}Width': 'TheIsAString' is not a valid value of the atomic type 'xs:integer'.
testXML.xml fails to validate
And that is all correct - There is two errors in the XML file.
Now I would like to have a tool of some kind to remove entry 3 and 4 so I end up with this result:
<?xml version="1.0" encoding="ISO-8859-1" ?>
<MasterData xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://www.testing.dk">
<Items>
<Item>
<Id>1</Id>
<Width>10</Width>
<Height>100</Height>
<Remark>This is OK</Remark>
</Item>
<Item>
<Id>2</Id>
<Width>20</Width>
<Height>200</Height>
<Remark>This is OK - But is missing Height a non mandatory field</Remark>
</Item>
<Item>
<Id>5</Id>
<Width>50</Width>
<Height>500</Height>
<Remark>This is OK and the last</Remark>
</Item>
</Items>
</MasterData>
Does anybody in here have a tool that can do this? I'm currently using bash scripting and the xmllint. I really hope somebody can help.