c# - 定义不明确的 XML，将所有子节点的节点和内容作为带空格的字符串连接？

Question

这是一些很棒的 XML 示例：

<root>
    <section>Here is some text<mightbe>a tag</mightbe>might <not attribute="be" />. Things are just<label>a mess</label>but I have to parse it because that's what needs to be done and I can't <font stupid="true">control</font> the source. <p>Why are there p tags here?</p>Who knows, but there may or may not be spaces around them so that's awesome. The point here is, there's node soup inside the section node and no definition for the document.</section>
</root>

我只想从部分节点和所有子节点中获取文本作为字符串。但是，请注意子节点周围可能有也可能没有空格，所以我想填充子注释并附加一个空格。

这是一个更精确的示例，说明输入可能是什么样的，以及我希望输出是什么样的：

<root>
    <sample>A good story is the<book>Hitchhikers Guide to the Galaxy</book>. It was published<date>a long time ago</date>. I usually read at<time>9pm</time>.</sample>
</root>

我希望输出是：

A good story is the Hitchhikers Guide to the Galaxy. It was published a long time ago. I usually read at 9pm.

请注意，子节点周围没有空格，因此我需要填充它们，否则单词会一起运行。

我试图使用这个示例代码：

XDocument doc = XDocument.Parse(xml);
foreach(var node in doc.Root.Elements("section"))
{
    output += String.Join(" ", node.Nodes().Select(x => x.ToString()).ToArray()) + " ";
 }

但是输出包括子标签，并且不会起作用。

这里有什么建议吗？

TL;DR: 获得了节点汤 xml，并希望通过子节点周围的填充对其进行字符串化。

score 1 · Accepted Answer

如果您将标签嵌套到未知级别（例如<date>a <i>long</i> time ago</date>），您可能还需要递归，以便始终一致地应用格式。例如..

private static string Parse(XElement root)
{
    return root
        .Nodes()
        .Select(a => a.NodeType == XmlNodeType.Text ? ((XText)a).Value : Parse((XElement)a))
        .Aggregate((a, b) => String.Concat(a.Trim(), b.StartsWith(".") ? String.Empty : " ", b.Trim()));
}

score 0 · Accepted Answer

您可以尝试使用 xpath 来提取您需要的内容

var docNav = new XPathDocument(xml);

// Create a navigator to query with XPath.
var nav = docNav.CreateNavigator();

// Find the text of every element under the root node
var expression = "/root//*/text()";

// Execute the XPath expression
var resultString = nav.evaluate(expression);

// Do some stuff with resultString
....

参考：查询 XML，XPath 语法

score 0 · Accepted Answer

这是遵循您的初始代码的可能解决方案：

private string extractSectionContents(XElement section)
{
    string output = "";
    foreach(var node in section.Nodes())
    {
        if(node.NodeType == System.Xml.XmlNodeType.Text)
        {
            output += string.Format("{0}", node);
        }
        else if(node.NodeType == System.Xml.XmlNodeType.Element)
        {
            output += string.Format(" {0} ", ((XElement)node).Value);
        }
    }

    return output;
}

您的逻辑的一个问题是，当放置在元素之后时，句点前面会有一个空格。

score 0 · Accepted Answer

您正在查看“混合内容”节点。它们没有什么特别之处——只需获取所有子节点（文本节点也是节点）并将它们的值与空格连接。

就像是

var result = String.Join("", 
  root.Nodes().Select(x => x is XText ? ((XText)x).Value : ((XElement)x).Value));

c# - 定义不明确的 XML，将所有子节点的节点和内容作为带空格的字符串连接？

4 回答 4

Related

Reference