0

I want to remove all invalid text from an XML document. I consider any text not wrapped in <> XML brackets to be invalid, and want to strip these prior to translation.

From this post Regular expression to remove text outside the tags in a string - it explains how to match XML brackets together. However on my example it doesn't clean up the text outside of the XML as can be seen in this example. https://regex101.com/r/6iUyia/1

I dont think this specific example has been asked on S/O before from my initial research.

Currently in my code, I have this XML as a string, before I compose an XDocument from it later on. So I potentially have string, Regex and XDocument methods available to assist in removing this, there could additionally be more than one bit of invalid XML present in these documents. Additionally, I do not wish to use XSLT to remove these values.

One of the very rudimentary idea's I tried and failed to compose, was to iterate over the string as a char array, and attempting to remove it if it was outside of '>' and '<' but decided there must be a better way to achieve this (hence the question)

This is an example of the input, with invalid text being displayed between nested-A and nested-B

 <ASchema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xdt="http://www.w3.org/2005/xpath-datatypes" xmlns:fn="http://www.w3.org/2005/xpath-functions">
   <A>
         <nested-A>valid text</nested-A>
         Remove text not inside valid xml braces
         <nested-B>more valid text here</nested-B>
   </A>
</ASchema>

I expect the output to be in a format like the below.

 <ASchema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xdt="http://www.w3.org/2005/xpath-datatypes" xmlns:fn="http://www.w3.org/2005/xpath-functions">
   <A>
         <nested-A>valid text</nested-A>
         <nested-B>more valid text here</nested-B>
   </A>
</ASchema>
4

1 回答 1

1

You could do the following . Please note I have done very limited testing, kindly let me know if it fails in some scenarios .

XmlDocument doc = new XmlDocument();
doc.LoadXml(str);
var json = JsonConvert.SerializeXmlNode(doc);

string result = JToken.Parse(json).RemoveFields().ToString(Newtonsoft.Json.Formatting.None);
var xml = (XmlDocument)JsonConvert.DeserializeXmlNode(result);

Where RemoveFields are defined as

public static class Extensions
{
public static JToken RemoveFields(this JToken token)
{
    JContainer container = token as JContainer;
    if (container == null) return token;

    List<JToken> removeList = new List<JToken>();
    foreach (JToken el in container.Children())
    {
        JProperty p = el as JProperty;
        if (p != null && p.Name.StartsWith("#"))
        {
            removeList.Add(el);
        }
        el.RemoveFields();
    }

    foreach (JToken el in removeList)
        el.Remove();

    return token;
}
}

Output

<ASchema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xdt="http://www.w3.org/2005/xpath-datatypes" xmlns:fn="http://www.w3.org/2005/xpath-functions">
   <A>
      <nested-A>valid text</nested-A>
      <nested-B>more valid text here</nested-B>
   </A>
</ASchema>

Please note am using Json.net in above code

于 2019-01-03T05:37:45.563 回答