I want to remove all invalid text from an XML document. I consider any text not wrapped in <> XML brackets to be invalid, and want to strip these prior to translation.
From this post Regular expression to remove text outside the tags in a string - it explains how to match XML brackets together. However on my example it doesn't clean up the text outside of the XML as can be seen in this example. https://regex101.com/r/6iUyia/1
I dont think this specific example has been asked on S/O before from my initial research.
Currently in my code, I have this XML as a string, before I compose an XDocument from it later on. So I potentially have string, Regex and XDocument methods available to assist in removing this, there could additionally be more than one bit of invalid XML present in these documents. Additionally, I do not wish to use XSLT to remove these values.
One of the very rudimentary idea's I tried and failed to compose, was to iterate over the string as a char array, and attempting to remove it if it was outside of '>' and '<' but decided there must be a better way to achieve this (hence the question)
This is an example of the input, with invalid text being displayed between nested-A and nested-B
<ASchema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xdt="http://www.w3.org/2005/xpath-datatypes" xmlns:fn="http://www.w3.org/2005/xpath-functions">
<A>
<nested-A>valid text</nested-A>
Remove text not inside valid xml braces
<nested-B>more valid text here</nested-B>
</A>
</ASchema>
I expect the output to be in a format like the below.
<ASchema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xdt="http://www.w3.org/2005/xpath-datatypes" xmlns:fn="http://www.w3.org/2005/xpath-functions">
<A>
<nested-A>valid text</nested-A>
<nested-B>more valid text here</nested-B>
</A>
</ASchema>