c# - 在c#中并行读取一百万行的csv文件

Question

我有一个包含超过 100 万行数据的 CVS 文件。我打算并行阅读它们以提高效率。我可以执行以下操作还是有更有效的方法？

namespace ParallelData
{
public partial class ParallelData : Form
{
    public ParallelData()
    {
        InitializeComponent();
    }

    private static readonly char[] Separators = { ',', ' ' };

    private static void ProcessFile()
    {
        var lines = File.ReadLines("BigData.csv");
        var numbers = ProcessRawNumbers(lines);

        var rowTotal = new List<double>();
        var totalElements = 0;

        foreach (var values in numbers)
        {
            var sumOfRow = values.Sum();
            rowTotal.Add(sumOfRow);
            totalElements += values.Count;
        }
        MessageBox.Show(totalElements.ToString());
    }

    private static List<List<double>> ProcessRawNumbers(IEnumerable<string> lines)
    {
        var numbers = new List<List<double>>();
        /*System.Threading.Tasks.*/
        Parallel.ForEach(lines, line =>
        {
            lock (numbers)
            {
                numbers.Add(ProcessLine(line));
            }
        });
        return numbers;
    }

    private static List<double> ProcessLine(string line)
    {
        var list = new List<double>();
        foreach (var s in line.Split(Separators, StringSplitOptions.RemoveEmptyEntries))
        {
            double i;
            if (Double.TryParse(s, out i))
            {
                list.Add(i);
            }
        }
        return list;
    }

    private void button2_Click(object sender, EventArgs e)
    {
        ProcessFile();
    }
}
}

score 13 · Accepted Answer

我不确定这是个好主意。根据您的硬件，CPU 不会成为瓶颈，磁盘读取速度会。

还有一点：如果您的存储硬件是磁硬盘，那么磁盘读取速度与文件在磁盘中的物理存储方式密切相关；如果文件没有碎片（即所有文件块都按顺序存储在磁盘上），如果按顺序逐行读取，您将获得更好的性能。

一种解决方案是一次读取整个文件（如果您有足够的内存空间，对于 100 万行应该没问题）File.ReadAllLines，使用将所有行存储在字符串数组中，然后处理（即使用string.Split...等解析）。 ) 在您的中Parallel.Foreach，如果行顺序不重要。

score 0 · Accepted Answer

一般来说，您应该尽量避免在多个线程上访问磁盘。磁盘是瓶颈并且会阻塞，因此可能会影响性能。

如果文件中行的大小不是问题，您可能应该先读取整个文件，然后并行处理。

如果文件太大而无法执行此操作或不实用，则可以使用BlockingCollection加载它。使用一个线程读取文件并填充 BlockingCollection，然后使用 Parallel.ForEach 来处理其中的项目。BlockingCollection 允许您指定集合的最大大小，因此它只会从文件中读取更多行，因为集合中已经存在的内容会被处理和删除。

        static void Main(string[] args)
    {
        string  filename = @"c:\vs\temp\test.txt";
        int maxEntries = 2;

        var c = new BlockingCollection<String>(maxEntries);
        
        var taskAdding = Task.Factory.StartNew(delegate
        {
            var lines = File.ReadLines(filename);
            foreach (var line in lines)
            {
                c.Add(line);    // when there are maxEntries items
                                // in the collection, this line 
                                // and thread will block until 
                                // the processing thread removes 
                                // an item
            }

            c.CompleteAdding(); // this tells the collection there's
                                // nothing more to be added, so the 
                                // enumerator in the other thread can 
                                // end
        });

        while (c.Count < 1)
        {
            // this is here simply to give the adding thread time to
            // spin up in this much simplified sample
        }

        Parallel.ForEach(c.GetConsumingEnumerable(), i =>
           {
               // NOTE: GetConsumingEnumerable() removes items from the 
               //   collection as it enumerates over it, this frees up
               //   the space in the collection for the other thread
               //   to write more lines from the file
               Console.WriteLine(i);  
           });

        Console.ReadLine();
    }

不过，与其他一些问题一样，我不得不问一个问题：您真的需要尝试通过并行化进行优化，还是单线程解决方案的性能足够好？多线程增加了很多复杂性，有时并不值得。

你看到什么样的表现是你想要改进的？

score 0 · Accepted Answer

我在我的计算机上检查了这些行，看起来使用 Parallel 读取 csv 文件而没有任何 cpu 昂贵的计算是没有意义的。与在一个线程中相比，并行运行它需要更多时间。这是我的结果：对于上面的代码：

2699ms 2712ms（检查两次只是为了确认结果）

然后：

private static IEnumerable<List<double>> ProcessRawNumbers2(IEnumerable<string> lines)
{
        var numbers = new List<List<double>>();
        foreach(var line in lines)
        {
            lock (numbers)
            {
                numbers.Add(ProcessLine(line));
            }
        }
    return numbers;
}

给我：2075ms 2106ms

所以我认为，如果 csv 中的这些数字不需要在程序中以某种方式（通过一些广泛的计算左右）计算然后存储在程序中，那么在这种情况下使用并行性是没有意义的，因为这会增加一些开销.

c# - 在c#中并行读取一百万行的csv文件

3 回答 3

Related

Reference