0

Edit: See my Solution below...

I had the following Problem to solve: We receive Files (mostly adress-Information) from different sources, these can be in Windows Standard with CR/LF ('\r''\n') as Line Break or UNIX with LF ('\n').

When reading text in using the StreamReader.ReadLine() method, this is no Problem because it handles both cases equally.

The Problem occurs when you have a CR or a LF somewhere in the File that is not supposed to be there. This happens for example if you Export a EXCEL-File with Cells that contain LineBreaks within the Cell to .CSV or other Flat-Files.

Now you have a File that for example has the following structure:

FirstName;LastName;Street;HouseNumber;PostalCode;City;Country'\r''\n'
Jane;Doe;co James Doe'\n'TestStreet;5;TestCity;TestCountry'\r''\n'
John;Hancock;Teststreet;1;4586;TestCity;TestCounty'\r''\n'

Now the StreamReader.ReadLine() Method reads the First Line as:

FirstName;LastName;Street;HouseNumber;PostalCode;City;Country

Which is fine but the seccond Line will be:

Jane;Doe;co James Doe

This will either break your Code or you will have false Results, as the following Line will be:

TestStreet;5;TestCity;TestCountry

So we usualy ran the File trough a tool that checks if there are loose '\n' or '\r' arround and delete them.

But this step is easy to Forget and so I tried to implement a ReadLine() method of my own. The requirement was that it would be able to use one or two LineBreak characters and those characters could be defined freely by the consuming logic.

This is the Class that I came up with:

 public class ReadFile
{
    private FileStream file;
    private StreamReader reader;

    private string fileLocation;
    private Encoding fileEncoding;
    private char lineBreak1;
    private char lineBreak2;
    private bool useSeccondLineBreak;

    private bool streamCreated = false;

    private bool endOfStream;

    public bool EndOfStream
    {
        get { return endOfStream; }
        set { endOfStream = value; }
    }

    public ReadFile(string FileLocation, Encoding FileEncoding, char LineBreak1, char LineBreak2, bool UseSeccondLineBreak)
    {
        fileLocation = FileLocation;
        fileEncoding = FileEncoding;
        lineBreak1 = LineBreak1;
        lineBreak2 = LineBreak2;
        useSeccondLineBreak = UseSeccondLineBreak;
    }

    public string ReadLine()
    {
        if (streamCreated == false)
        {
            file = new FileStream(fileLocation, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
            reader = new StreamReader(file, fileEncoding);

            streamCreated = true;
        }

        StringBuilder builder = new StringBuilder();
        char[] buffer = new char[1];
        char lastChar = new char();
        char currentChar = new char();

        bool first = true;
        while (reader.EndOfStream != true)
        {
            if (useSeccondLineBreak == true)
            {
                reader.Read(buffer, 0, 1);
                lastChar = currentChar;

                if (currentChar == lineBreak1 && buffer[0] == lineBreak2)
                {
                    break;
                }
                else
                {
                    currentChar = buffer[0];
                }

                if (first == false)
                {
                    builder.Append(lastChar);
                }
                else
                {
                    first = false;
                }
            }
            else
            {
                reader.Read(buffer, 0, 1);

                if (buffer[0] == lineBreak1)
                {
                    break;
                }
                else
                {
                    currentChar = buffer[0];
                }

                builder.Append(currentChar);
            }
        }

        if (reader.EndOfStream == true)
        {
            EndOfStream = true;
        }

        return builder.ToString();
    }

    public void Close()
    {
        if (streamCreated == true)
        {
            reader.Close();
            file.Close();
        }
    }
}

This code works fine, it does what it is supposed to do but compared to the original StreamReader.ReadLine() method, it is ~3 Times slower. As we work with large row-Counts the difference is not only messured but also reflected in real world Performance. (for 700'000 Rows it takes ~ 5 Seconds to read all Lines, extract a Chunk and write it to a new File, with my method it takes ~15 Seconds on my system)

I tried different aproaches with bigger buffers but so far I wasn't able to increase Performance.

What I would be interessted in: Any suggestions how I could improve the performance of this code to get closer to the original Performance of StreamReader.ReadLine()?

Solution:

This now takes ~6 Seconds (compared to ~5 Sec using the Default 'StreamReader.ReadLine()' ) for 700'000 Rows to do the same things as the code above does.

Thanks Jim Mischel for pointing me in the right direction!

public class ReadFile
    {
        private FileStream file;
        private StreamReader reader;

        private string fileLocation;
        private Encoding fileEncoding;
        private char lineBreak1;
        private char lineBreak2;
        private bool useSeccondLineBreak;

        const int BufferSize = 8192;
        int bufferedCount;
        char[] rest = new char[BufferSize];
        int position = 0;

        char lastChar;
        bool useLastChar;

        private bool streamCreated = false;

        private bool endOfStream;

        public bool EndOfStream
        {
            get { return endOfStream; }
            set { endOfStream = value; }
        }

        public ReadFile(string FileLocation, Encoding FileEncoding, char LineBreak1, char LineBreak2, bool UseSeccondLineBreak)
        {
            fileLocation = FileLocation;
            fileEncoding = FileEncoding;
            lineBreak1 = LineBreak1;
            lineBreak2 = LineBreak2;
            useSeccondLineBreak = UseSeccondLineBreak;
        }
 
        private int readInBuffer()
        {
            return reader.Read(rest, 0, BufferSize);
        }

        public string ReadLine()
        {
            StringBuilder builder = new StringBuilder();
            bool lineFound = false;

            if (streamCreated == false)
            {
                file = new FileStream(fileLocation, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 8192);

                reader = new StreamReader(file, fileEncoding);

                streamCreated = true;

                bufferedCount = readInBuffer();
            }
            
            while (lineFound == false && EndOfStream != true)
            {
                if (position < bufferedCount)
                {
                    for (int i = position; i < BufferSize; i++)
                    {
                        if (useLastChar == true)
                        {
                        useLastChar = false;

                        if (rest[i] == lineBreak2)
                        {
                            count++;
                            position = i + 1;
                            lineFound = true;
                            break;
                        }
                        else
                        {
                            builder.Append(lastChar);
                        }
                        }

                        if (rest[i] == lineBreak1)
                        {
                            if (useSeccondLineBreak == true)
                            {
                                if (i + 1 <= BufferSize - 1)
                                {
                                    if (rest[i + 1] == lineBreak2)
                                    {
                                        position = i + 2;
                                        lineFound = true;
                                        break;
                                    }
                                    else
                                    {
                                        builder.Append(rest[i]);
                                    }
                                }
                                else
                                {
                                    useLastChar = true;
                                    lastChar = rest[i];
                                }
                            }
                            else
                            {
                                position = i + 1;
                                lineFound = true;
                                break;
                            }
                        }
                        else
                        {
                            builder.Append(rest[i]);
                        }

                        position = i + 1;
                    }
                    
                }
                else
                {
                    bufferedCount = readInBuffer();
                    position = 0;
                }
            }

            if (reader.EndOfStream == true && position == bufferedCount)
            {
                EndOfStream = true;
            }

            return builder.ToString();
        }


        public void Close()
        {
            if (streamCreated == true)
            {
                reader.Close();
                file.Close();
            }
        }
    }
4

1 回答 1

1

加快速度的方法是让它一次读取多个字符。例如,创建一个 4 KB 的缓冲区,将数据读入该缓冲区,然后逐个字符地读取。如果您将逐个字符复制到 aStringBuilder中,则非常简单。

下面的代码显示了如何解析循环中的行。您必须将其拆分,以便它可以在调用之间保持状态,但它应该给您这个想法。

const int BufferSize = 4096;
const string newline = "\r\n";

using (var strm = new StreamReader(....))
{
    int newlineIndex = 0;
    var buffer = new char[BufferSize];
    StringBuilder sb = new StringBuilder();
    int charsInBuffer = 0;
    int bufferIndex = 0;
    char lastChar = (char)-1;

    while (!(strm.EndOfStream && bufferIndex >= charsInBuffer))
    {
        if (bufferIndex > charsInBuffer)
        {
            charsInBuffer = strm.Read(buffer, 0, buffer.Length);
            if (charsInBuffer == 0)
            {
                // nothing read. Must be at end of stream.
                break;
            }
            bufferIndex = 0;
        }
        if (buffer[bufferIndex] == newline[newlineIndex])
        {
            ++newlineIndex;
            if (newlineIndex == newline.Length)
            {
                // found a line
                Console.WriteLine(sb.ToString());
                newlineIndex = 0;
                sb = new StringBuilder();
            }
        }
        else
        {
            if (newlineIndex > 0)
            {
                // copy matched newline characters
                sb.Append(newline.Substring(0, newlineIndex));
                newlineIndex = 0;
            }
            sb.Append(buffer[bufferIndex]);
        }
        ++bufferIndex;
    }
    // Might be a line left, without a newline
    if (newlineIndex > 0)
    {
        sb.Append(newline.Substring(0, newlineIndex));
    }
    if (sb.Length > 0)
    {
        Console.WriteLine(sb.ToString());
    }
}

您可以通过跟踪起始位置来优化这一点,这样当您找到一条线时,您可以创建一个字符串 from buffer[start]to buffer[current],而无需创建StringBuilder. 相反,您调用String(char[], int32, int32)构造函数。当您越过缓冲区边界时,这有点难以处理。可能希望将跨越缓冲区边界作为一种特殊情况进行处理,并StringBuilder在这种情况下使用 a 进行临时存储。

不过,在我让第一个版本正常工作之前,我不会为这种优化而烦恼。

于 2013-08-01T14:21:14.640 回答