5

Is there some way I can combine two XmlDocuments without holding the first in memory?

I have to cycle through a list of up to a hundred large (~300MB) XML files, appending to each up to 1000 nodes, repeating the whole process several times (as the new node list is cleared to save memory). Currently I load the whole XmlDocument into memory before appending new nodes, which is currently not tenable.

What would you say is the best way to go about this? I have a few ideas but I'm not sure which is best:

  1. Never load the whole XMLDocument, instead using XmlReader and XmlWriter simultaneously to write to a temp file which is subsequently renamed.
  2. Make a XmlDocument for the new nodes only, and then manually write it to the existing file (i.e. file.WriteLine( "<node>\n" )
  3. Something else?

Any help will be much appreciated.

Edit Some more details in answer to some of the comments:

The program parses several large logs into XML, grouping into different files by source. It only needs to run once a day, and once the XML is written there is a lightweight proprietary reader program which gives reports on the data. The program only needs to run once a day so can be slow, but runs on a server which performs other actions, mainly file compression and transfer, which cannot be effected too much.

A database would probably be easier, but the company isn't going to do this any time soon!

As is, the program runs on the dev machine using a few GB of memory at the most, but throws out of memory exceptions when run on the sever.

Final Edit The task is quite low-prority, which is why it would only cost extra to get a database (though I will look into mongo).

The file will only be appended to, and won't grow indefinitely - each final file is only for a day's worth of the log, and then new files are generated the following day.

I'll probably use the XmlReader/Writer method since it will be easiest to ensure XML validity, but I have taken all your comments/answers into consideration. I know that having XML files this large is not a particularly good solution, but it's what I'm limited to, so thanks for all the help given.

4

1 回答 1

2

如果您希望完全确定 XML 结构,使用 XMLWriter 和 XMLReader 是最好的方法。

但是,为了获得尽可能高的性能,您可以使用直接字符串函数快速重新创建此代码。您可以这样做,尽管您将失去验证 XML 结构的能力 - 如果一个文件有错误,您将无法更正它:

using (StreamWriter sw = new StreamWriter("out.xml")) {
    foreach (string filename in files) {
        sw.Write(String.Format(@"<inputfile name=""{0}"">", filename));
        using (StreamReader sr = new StreamReader(filename)) {
            // Using .NET 4's CopyTo(); alternatively try http://bit.ly/RiovFX
            if (max_performance) {
                sr.CopyTo(sw);
            } else {
                string line = sr.ReadLine();
                // parse the line and make any modifications you want
                sw.Write(line);
                sw.Write("\n");
            }
        }
        sw.Write("</inputfile>");
    }
}

根据输入 XML 文件的结构方式,您可能会选择删除 XML 标头、可能是文档元素或一些其他不必要的结构。您可以通过逐行解析文件来做到这一点

于 2012-08-03T18:23:46.117 回答