2

我正在处理来自第三方的 XML 文件。这些文件中偶尔会包含无效字符,这会导致XMLTextReader.Read()引发异常。

我目前正在使用以下功能处理此问题:

XmlTextReader GetCharSafeXMLTextReader(string fileName)
{
    try
    {
        MemoryStream ms = new MemoryStream();
        StreamReader sr = new StreamReader(fileName);
        StreamWriter sw = new StreamWriter(ms);
        string temp;
        while ((temp = sr.ReadLine()) != null)
            sw.WriteLine(temp.Replace(((char)4).ToString(), "").Replace(((char)0x14).ToString(), ""));

        sw.Flush();
        sr.Close();
        ms.Seek(0, SeekOrigin.Begin);
        return new XmlTextReader(ms);
    }
    catch (Exception exp)
    {
        throw new Exception("Error parsing file: " + fileName + " " + exp.Message, exp.InnerException);
    }
}

我的直觉是应该有更好/更快的方法来做到这一点。(是的,让第三方修复他们的 XML 会很棒,但目前还没有发生。)

编辑:这是基于 cfeduke 的回答的最终解决方案:


    public class SanitizedStreamReader : StreamReader
    {
        public SanitizedStreamReader(string filename) : base(filename) { }
        /* other ctors as needed */
        // this is the only one that XmlTextReader appears to use but
        // it is unclear from the documentation which methods call each other
        // so best bet is to override all of the Read* methods and Peek
        public override string ReadLine()
        {
            return Sanitize(base.ReadLine());
        }

        public override int Read()
        {
            int temp = base.Read();
            while (temp == 0x4 || temp == 0x14)
                temp = base.Read();
            return temp;
        }

        public override int Peek()
        {
            int temp = base.Peek();
            while (temp == 0x4 || temp == 0x14)
            {
                temp = base.Read();
                temp = base.Peek();
            }
            return temp;
        }

        public override int Read(char[] buffer, int index, int count)
        {
            int temp = base.Read(buffer, index, count);
            for (int x = index; x < buffer.Length; x++)
            {
                if (buffer[x] == 0x4 || buffer[x] == 0x14)
                {
                    for (int a = x; a < buffer.Length - 1; a++)
                        buffer[a] = buffer[a + 1];
                    temp--; //decrement the number of characters read
                }  
            }
            return temp;
        }

        private static string Sanitize(string unclean)
        {
            if (unclean == null)
                return null;
            if (String.IsNullOrEmpty(unclean))
                return "";
            return unclean.Replace(((char)4).ToString(), "").Replace(((char)0x14).ToString(), "");
        }
    }
4

2 回答 2

6

清理数据很重要。有时边缘情况——“XML”中的无效字符——确实会发生。你的解决方案是正确的。如果您想要一个适合 .NET 框架的解决方案来进行流式处理,请重组您的代码以适合其自己的流:

public class SanitizedStreamReader : StreamReader {
  public SanitizedStreamReader(string filename) : base(filename) { }
  /* other ctors as needed */

  // it is unclear from the documentation which methods call each other
  // so best bet is to override all of the Read* methods and Peak
  public override string ReadLine() {
    return Sanitize(base.ReadLine());
  }

  // TODO override Read*, Peak with a similar logic as this.ReadLine()
  // remember Read(Char[], Int32, Int32) to modify the return value by
  // the number of removed characters

  private static string Sanitize(string unclean) {
    if (String.IsNullOrEmpty(unclean)
      return "";
    return unclean.Replace(((char)4).ToString(), "").Replace(((char)0x14);
  }
}

有了这个新SanitizedStreamReader功能,您将能够根据需要将其链接到处理流中,而不是依赖一种神奇的方法来清理事物并为您提供 XmlTextReader:

return new XmlTextReader(new SanitizedStreamReader("filename.xml"));

诚然,这可能比必要的工作量更大,但您将从这种方法中获得灵活性。

于 2013-01-09T17:10:09.423 回答
1

除了 XML 问题,如果文件不够大,无法保证按顺序处理,我会将代码简化为以下几行:

var xml = File.ReadAllText(pathName);
var fixedXml = xml.Replace(((char)4).ToString(), "").Replace(((char)0x14).ToString(), "");
File.WriteAllText(pathName, fixedXml);
于 2013-01-09T17:12:22.077 回答