4

我想将 XML Like 字符串拆分为 c# 或 sql 中的标记。例如输入字符串就像

<entry><AUTHOR>C. Qiao</AUTHOR> and <AUTHOR>R.Melhem</AUTHOR>, "<TITLE>Reducing Communication </TITLE>",<DATE>1995</DATE>. </entry>

我想要这个输出:

C       AUTHOR
.       AUTHOR
Qiao    AUTHOR
and 
R       AUTHOR
.       AUTHOR
Melhem  AUTHOR
,   
"
Reducing        TITLE
Communication   TITLE
"
,
1995    DATE
.
4

2 回答 2

1

这是如何解决这个问题的第一次尝试,考虑以下几点:
1. XML 字符串将是有效的(即标签之间不会有任何无效字符)
像这样:

string xml = @"<ENTRY><AUTHOR>C. Qiao</AUTHOR>
                                  <AUTHOR>R.Melhem</AUTHOR>
                                  <TITLE>Reducing Communication </TITLE>
                                  <DATE>1995</DATE>
                           </ENTRY>";

2. 拆分将按空间完成' '

string xml = @"<ENTRY><AUTHOR>C. Qiao</AUTHOR>
                              <AUTHOR>R.Melhem</AUTHOR>
                              <TITLE>Reducing Communication </TITLE>
                              <DATE>1995</DATE>
                       </ENTRY>";
        XElement doc = XElement.Parse(xml);
        foreach (XElement element in doc.Elements())
        {

            var values = element.Value.Split(' ');
            foreach (string value in values)
            {
                Console.WriteLine(element.Name + " " + value);
            }
        }

会打印出来

AUTHOR C.
AUTHOR Qiao
AUTHOR R.Melhem
TITLE Reducing
TITLE Communication
TITLE
DATE 1995

编辑:

现在,根据“。”进行拆分。和一个空间,最好的办法是使用正则表达式。像这样:

   var values = Regex.Split(element.Value, @"(\.| )");
        foreach (string value in values.Where(x=>!String.IsNullOrWhiteSpace(x)))
        {
            Console.WriteLine(element.Name + " " + value);
        }   

如果您愿意,可以添加更多分隔符。以下示例将为您提供以下内容:

AUTHOR C
AUTHOR .
AUTHOR Qiao
AUTHOR R
AUTHOR .
AUTHOR Melhem
TITLE Reducing
TITLE Communication
DATE 1995

Edit2:
这是一个适用于您的原始字符串的示例,它很可能不是最好的方法,因为它没有正确的标记顺序,但它应该非常接近:

 string xml = @" <entry>
                            <AUTHOR>C. Qiao</AUTHOR> 
                            and 
                            <AUTHOR>R.Melhem</AUTHOR>, 
                            ""<TITLE>Reducing Communication </TITLE>""
                           ,<DATE>1995</DATE>. 
                           </entry>";
            //Parse xml to XDocument
            XDocument doc = XDocument.Parse(xml);

            // Get first element (we only have one)
            XElement element = doc.Descendants().FirstOrDefault();

            //Create a copy of an element for use by child elements.
            XElement copyElement = new XElement(element);
            //Remove all child nodes from root leaving only text
            element.Elements().Remove();

            //Splitting based on the tokens specified
                var values = Regex.Split(element.Value, @"(\.| |\,|\"")");
                    foreach (string value in values.Where(x => !String.IsNullOrWhiteSpace(x)))
                    {
                        Console.WriteLine(value);
                    }
            //Getting children nodes and splitting the same way
            foreach (XElement elem in copyElement.Elements())
            {
                var val = Regex.Split(elem.Value, @"(\.| |\,|\"")");
                foreach (string value in val.Where(x => !String.IsNullOrWhiteSpace(x)))
                {
                    Console.WriteLine(value + " " + elem.Name);
                }
            }
            //You can try to play with DescendantsAndSelf 
            //to see if you can do it in single action and with order preserved.
            //foreach (XElement elem in element.DescendantsAndSelf())
            //{
            //    //....
            //}   

这将打印出以下内容:

and
,
"
"
,
.
C AUTHOR
. AUTHOR
Qiao AUTHOR
R AUTHOR
. AUTHOR
Melhem AUTHOR
Reducing TITLE
Communication TITLE
1995 DATE
于 2012-12-03T06:09:12.183 回答
0

编辑:刚刚注意到我读错了问题 - 从第一个答案而不是从问题中复制了格式化的 XML,我没有注意到字符串中的混合内容节点。这使它更容易。解决方案可能如下所示:

using System;
using System.Linq;
using System.Text;

using System.Xml;
using System.Xml.Linq;

class Program
{
    static void Main(string[] args)
    {
        var xml = @"<entry><AUTHOR>C. Qiao</AUTHOR> and <AUTHOR>R.Melhem</AUTHOR>, ""<TITLE>Reducing Communication </TITLE>"",<DATE>1995</DATE>. </entry>";

        var elem = XElement.Parse(xml);

        var tokFunc = new Func<XNode, string>(node =>
            {
                var s = node.ToString().Replace(".", " . ").Replace(",", " , ");
                var nodeName = node.Parent != null && 
                    node.Parent.NodeType == XmlNodeType.Element && 
                    node.Parent.Name.LocalName.ToUpper() != "ENTRY"
                                   ? node.Parent.Name.LocalName
                                   : "";
                var sb = new StringBuilder();

                s.Split(new[] {' '}, StringSplitOptions.RemoveEmptyEntries).ToList().ForEach(e => sb.AppendFormat("{0}\t{1}\n", e, nodeName));
                return sb.ToString();
            });


        elem.DescendantNodes().Where(e => e.NodeType == XmlNodeType.Text).ToList()
            .ForEach(c => Console.Write(tokFunc(c)));


    }
}

这会产生所需的输出:

C       AUTHOR
.       AUTHOR
Qiao    AUTHOR
and
R       AUTHOR
.       AUTHOR
Melhem  AUTHOR
,
"
Reducing        TITLE
Communication   TITLE
"
,
1995    DATE
.
于 2012-12-03T07:11:36.857 回答