c# - string.split() 读取制表符分隔文件时出现“内存不足异常”

Question

我在我的 C# 代码中使用 string.split() 来读取制表符分隔的文件。我正面临代码示例中提到的“OutOfMemory 异常”。

在这里，我想知道为什么大小为 16 MB 的文件会出现问题？

这是正确的方法吗？

using (StreamReader reader = new StreamReader(_path))
{
  //...........Load the first line of the file................
  string headerLine = reader.ReadLine();

  MeterDataIPValueList objMeterDataList = new MeterDataIPValueList();
  string[] seperator = new string[1];   //used to sepreate lines of file

  seperator[0] = "\r\n";
  //.............Load Records of file into string array and remove all empty lines of file.................
  string[] line = reader.ReadToEnd().Split(seperator, StringSplitOptions.RemoveEmptyEntries);
  int noOfLines = line.Count();
  if (noOfLines == 0)
  {
    mFileValidationErrors.Append(ConstMsgStrings.headerOnly + Environment.NewLine);
  }
  //...............If file contains records also with header line..............
  else
  {
    string[] headers = headerLine.Split('\t');
    int noOfColumns = headers.Count();

    //.........Create table structure.............
    objValidateRecordsTable.Columns.Add("SerialNo");
    objValidateRecordsTable.Columns.Add("SurveyDate");
    objValidateRecordsTable.Columns.Add("Interval");
    objValidateRecordsTable.Columns.Add("Status");
    objValidateRecordsTable.Columns.Add("Consumption");

    //........Fill objValidateRecordsTable table by string array contents ............

    int recordNumber;  // used for log
    #region ..............Fill objValidateRecordsTable.....................
    seperator[0] = "\t";
    for (int lineNo = 0; lineNo < noOfLines; lineNo++)
    {
      recordNumber = lineNo + 1;
      **string[] recordFields = line[lineNo].Split(seperator, StringSplitOptions.RemoveEmptyEntries);** // Showing me error when we  split columns
      if (recordFields.Count() == noOfColumns)
      {
        //Do processing
      }

score 13 · Accepted Answer

拆分的实现很差，并且在应用于大字符串时会出现严重的性能问题。请参阅本文以了解有关拆分功能的内存要求的详细信息：

当您对包含 1355049 个逗号分隔的字符串（每个字符串有 16 个字符，总字符长度为 25745930 ）进行拆分时会发生什么？

指向字符串对象的指针数组：连续虚拟地址空间 4（地址指针）*1355049 = 5420196（数组大小）+ 16（用于簿记）= 5420212。

1355049 个字符串的非连续虚拟地址空间，每个 54 字节。这并不意味着所有这 130 万个字符串将分散在整个堆中，但它们不会分配到 LOH 上。GC 会将它们分配到 Gen0 堆上的束上。

Split.Function 将创建大小为 25745930 的 System.Int32[] 的内部数组，消耗（102983736 字节）~98MB 的 LOH，这是非常昂贵的 L。

score 11 · Accepted Answer

尽量不要先将整个文件读入数组“reader.ReadToEnd()”直接逐行读取文件..

  using (StreamReader sr = new StreamReader(this._path))
        {
            string line = "";
            while(( line= sr.ReadLine()) != null)
            {
                string[] cells = line.Split(new string[] { "\t" }, StringSplitOptions.None);
                if (cells.Length > 0)
                {

                }
            }
        }

score 5 · Accepted Answer

我用我自己的。它已经过 10 个单元测试。

public static class StringExtensions
{

    // the string.Split() method from .NET tend to run out of memory on 80 Mb strings. 
    // this has been reported several places online. 
    // This version is fast and memory efficient and return no empty lines. 
    public static List<string> LowMemSplit(this string s, string seperator)
    {
        List<string> list = new List<string>();
        int lastPos = 0;
        int pos = s.IndexOf(seperator);
        while (pos > -1)
        {
            while(pos == lastPos)
            {
                lastPos += seperator.Length;
                pos = s.IndexOf(seperator, lastPos);
                if (pos == -1)
                    return list;
            }

            string tmp = s.Substring(lastPos, pos - lastPos);
            if(tmp.Trim().Length > 0)
                list.Add(tmp);
            lastPos = pos + seperator.Length;
            pos = s.IndexOf(seperator, lastPos);
        }

        if (lastPos < s.Length)
        {
            string tmp = s.Substring(lastPos, s.Length - lastPos);
            if (tmp.Trim().Length > 0)
                list.Add(tmp);
        }

        return list;
    }
}

score 4 · Accepted Answer

I would recommend reading line-by-line if you can, but sometimes splitting by new lines is not the requirement.

So you can always write your own memory efficient split. This solved the problem for me.

    private static IEnumerable<string> CustomSplit(string newtext, char splitChar)
    {
        var result = new List<string>();
        var sb = new StringBuilder();
        foreach (var c in newtext)
        {
            if (c == splitChar)
            {
                if (sb.Length > 0)
                {
                    result.Add(sb.ToString());
                    sb.Clear();
                }
                continue;
            }
            sb.Append(c);
        }
        if (sb.Length > 0)
        {
            result.Add(sb.ToString());
        }
        return result;
    }

score 1 · Accepted Answer

1

尝试逐行读取文件，而不是拆分整个内容。

于 2009-09-10T10:16:11.143 回答

c# - string.split() 读取制表符分隔文件时出现“内存不足异常”

5 回答 5

Related

Reference