c# - 使用正则表达式有效地解析 StreamReader

Question

我有变量

    StreamReader DebugInfo = GetDebugInfo();
    var text = DebugInfo.ReadToEnd();  // takes 10 seconds!!! because there are a lot of students

文本等于：

<student>
    <firstName>Antonio</firstName>
    <lastName>Namnum</lastName>
</student>
<student>
    <firstName>Alicia</firstName>
    <lastName>Garcia</lastName>
</student>
<student>
    <firstName>Christina</firstName>
    <lastName>SomeLattName</lastName>
</student>
... etc
.... many more students

我现在在做什么是：

  StreamReader DebugInfo = GetDebugInfo();
  var text = DebugInfo.ReadToEnd(); // takes 10 seconds!!!

  var mtch = Regex.Match(text , @"(?s)<student>.+?</student>");
  // keep parsing the file while there are more students
  while (mtch.Success)
  {
     AddStudent(mtch.Value); // parse text node into object and add it to corresponding node
     mtch = mtch.NextMatch();
  }

整个过程大约需要 25 秒。将 streamReader 转换为var text = DebugInfo.ReadToEnd();需要 10 秒的文本 ( )。另一部分大约需要 15 秒。我希望我能同时做这两个部分...

编辑

我想要类似的东西：

    const int bufferSize = 1024;

    var sb = new StringBuilder();

    Task.Factory.StartNew(() =>
    {
         Char[] buffer = new Char[bufferSize];
         int count = bufferSize;

         using (StreamReader sr = GetUnparsedDebugInfo())
         {

             while (count > 0)
             {
                 count = sr.Read(buffer, 0, bufferSize);
                 sb.Append(buffer, 0, count);
             }
         }

         var m = sb.ToString();
     });

     Thread.Sleep(100);

     // meanwhile string is being build start adding items

     var mtch = Regex.Match(sb.ToString(), @"(?s)<student>.+?</student>"); 

     // keep parsing the file while there are more nodes
     while (mtch.Success)
     {
         AddStudent(mtch.Value);
         mtch = mtch.NextMatch();
     }

编辑 2

概括

我忘了提抱歉，文本与 xml 非常相似，但事实并非如此。这就是我必须使用正则表达式的原因......简而言之，我认为我可以节省时间，因为我正在做的是将流转换为字符串然后解析字符串。为什么不用正则表达式解析流。或者，如果这不可能，为什么不获取流的一部分并在单独的线程中解析该块。

score 2 · Accepted Answer

更新：

此基本代码在 0.75 秒内读取（大约）20 兆字节的文件。我的机器应该在您引用的那 2 秒内大致处理 53.33 兆字节。此外，20,000,000 / 2,048 = 9765.625。.75 / 9765.625 = .0000768。这意味着您大约每 768 万分之一秒读取 2048 个字符。您需要了解与迭代时间相关的上下文切换成本，以确定增加的多线程复杂性是否合适。在 7.68X10^5 秒时，我看到您的读者线程大部分时间都处于空闲状态。这对我来说没有意义。只需对单个线程使用单个循环。

char[] buffer = new char[2048];
StreamReader sr = new StreamReader(@"C:\20meg.bin");
while(sr.Read(buffer, 0, 2048) != 0)
{
    ; // do nothing
}

对于像这样的大型操作，您希望使用只进、非缓存的读取器。看起来您的数据是 XML，因此 XmlTextReader 非常适合。这是一些示例代码。希望这可以帮助。

string firstName;
        string lastName;
        using (XmlTextReader reader = GetDebugInfo())
        {
            while (reader.Read())
            {
                if (reader.IsStartElement() && reader.Name == "student")
                {
                    reader.ReadToDescendant("firstName");
                    reader.Read();
                    firstName = reader.Value;
                    reader.ReadToFollowing("lastName");
                    reader.Read();
                    lastName = reader.Value;
                    AddStudent(firstName, lastName);
                }
            }
        }

我使用了以下 XML：

<students>
    <student>
        <firstName>Antonio</firstName>
        <lastName>Namnum</lastName>
    </student>
    <student>
        <firstName>Alicia</firstName>
        <lastName>Garcia</lastName>
    </student>
    <student>
        <firstName>Christina</firstName>
        <lastName>SomeLattName</lastName>
    </student>
</students>

你可能需要调整。这应该运行得更快。

score 1 · Accepted Answer

您可以逐行读取，但如果读取数据需要 15 秒，那么您无法加快速度。

在进行任何重大更改之前，请尝试简单地读取文件的所有行并且不进行任何处理。如果您的目标仍然需要更长的时间 - 调整目标/更改文件格式。否则，看看优化解析可以获得多少收益 - RegEx 对于不复杂的正则表达式来说非常快。

score 1 · Accepted Answer

RegEx 不是解析字符串的最快方法。您需要一个类似于 XmlReader 的定制解析器（以匹配您的数据结构）。它将允许您部分读取文件并比 RegEx 更快地解析它。

由于您有一组有限的标签和嵌套 FSM 方法 (http://en.wikipedia.org/wiki/Finite-state_machine) 将为您工作。

score 1 · Accepted Answer

这是最快的（也许我想尝试更多的东西）

创建了一个数组数组char[][] listToProcess = new char[200000][];，我将在其中放置流的块。在一个单独的任务中，我开始处理每个块。代码如下所示：

   StreamReader sr = GetUnparsedDebugInfo(); // get streamReader                        

   var task1 = Task.Factory.StartNew(() =>
   {
       Thread.Sleep(500); // wait a little so there are items on list (listToProcess) to work with
       StartProcesingList();
   });

   int counter = 0;

   while (true)
   {
       char[] buffer = new char[2048]; // crate a new buffer each time we will add it to the list to process

       var charsRead = sr.Read(buffer, 0, buffer.Length);

       if (charsRead < 1) // if we reach the end then stop
       {
           break;
       }

       listToProcess[counter] = buffer;
       counter++;
   }

   task1.Wait();

并且该方法StartProcesingList()基本上开始遍历列表，直到它到达一个空对象。

    void StartProcesingList()
    {
        int indexOnList = 0;

        while (true)
        {
            if (listToProcess[indexOnList] == null)
            {
                Thread.Sleep(100); // wait a little in case other thread is adding more items to the list

                if (listToProcess[indexOnList] == null)
                    break;
            }

            // add chunk to dictionary if you recall listToProcess[indexOnList] is a 
            // char array so it basically converts that to a string and splits it where appropiate
            // there is more logic as in the case where the last chunk will have to be 
            // together with the first chunk of the next item on the list
            ProcessChunk(listToProcess[indexOnList]);

            indexOnList++;                
        }

    }

score 0 · Accepted Answer

@kakridge是对的。例如，我可能正在处理一个任务正在编写 listToProces[30] 而另一个线程可能正在解析 listToProces[30] 的竞争条件。为了解决这个问题并删除效率低下的 Thread.Sleep 方法，我最终使用了信号量。这是我的新代码：

        StreamReader unparsedDebugInfo = GetUnparsedDebugInfo(); // get streamReader 
        listToProcess = new char[200000][];
        lastPart = null;
        matchLength = 0;

        // Used to signal events between thread that is reading text 
        // from readelf.exe and the thread that is parsing chunks
        Semaphore semaphore = new Semaphore(0, 1);

        // If task1 run out of chunks to process it will be waiting for semaphore to post a message
        bool task1IsWaiting = false;

        // Used to note that there are no more chunks to add to listToProcess.
        bool mainTaskIsDone = false;

        int counter = 0; // keep trak of which chunk we have added to the list

        // This task will be executed on a separate thread. Meanwhile the other thread adds nodes to  
        // "listToProcess" array this task will add those chunks to the dictionary. 
        var task1 = Task.Factory.StartNew(() =>
        {
            semaphore.WaitOne(); // wait until there are at least 1024 nodes to be processed

            int indexOnList = 0; // counter to identify the index of chunk[] we are adding to dictionary

            while (true)
            {
                if (indexOnList>=counter)   // if equal it might be dangerous! 
                {                           // chunk could be being written to and at the same time being parsed.
                    if (mainTaskIsDone)// if the main task is done executing stop
                        break;

                    task1IsWaiting = true; // otherwise wait until there are more chunks to be processed
                    semaphore.WaitOne();
                }

                ProcessChunk(listToProcess[indexOnList]); // add chunk to dictionary
                indexOnList++;
            }
        });


        // this block being executed on main thread  is responsible for placing the streamreader 
        // into chunks of char[] so that task1 can start processing those chunks
        {                
            int waitCounter = 1024; // every time task1 is waiting we use this counter to place at least 256 new chunks before continue to parse them

            while (true) // more chunks on listToProcess before task1 continues executing
            {
                char[] buffer = new char[2048]; // buffer where we will place data read from stream

                var charsRead = unparsedDebugInfo.Read(buffer, 0, buffer.Length);

                if (charsRead < 1){
                    listToProcess[counter] = pattern;
                    break;
                }

                listToProcess[counter] = buffer;
                counter++; // add chunk to list to be proceesed by task1.

                if (task1IsWaiting)
                {               // if task1 is waiting for more nodes process 256
                    waitCounter = counter + 256;    // more nodes then continue execution of task2
                    task1IsWaiting = false;
                }
                else if (counter == waitCounter)                    
                    semaphore.Release();                    
            }
        }

        mainTaskIsDone = true; // let other thread know that this task is done

        semaphore.Release(); // release all threads that might be waiting on this thread

        task1.Wait(); // wait for all nodes to finish processing

c# - 使用正则表达式有效地解析 StreamReader

编辑

编辑 2

5 回答 5

Related

Reference