c# - C#多线程文件IO（读取）

Question

我们有一种情况，我们的应用程序需要处理一系列文件，而不是同步执行此功能，我们希望使用多线程来将工作负载分配给不同的线程。

每项工作是：
1. 以只读方式打开文件
2. 处理文件中的数据
3. 将处理后的数据写入 Dictionary

我们想在一个新线程上执行每个文件的工作吗？这可能吗？我们应该更好地使用 ThreadPool 或生成新线程，记住每个“工作”项只需要 30 毫秒，但是可能需要处理数百个文件。

任何使这更有效的想法都值得赞赏。

编辑：目前我们正在使用 ThreadPool 来处理这个问题。如果我们有 500 个文件要处理，我们会循环遍历这些文件并使用 QueueUserWorkItem 将每个“处理工作单元”分配给线程池。

是否适合为此使用线程池？

score 8 · Accepted Answer

I would suggest you to use ThreadPool.QueueUserWorkItem(...), in this, threads are managed by the system and the .net framework. The chances of you meshing up with your own threadpool is much higher. So I would recommend you to use Threadpool provided by .net . It's very easy to use,

ThreadPool.QueueUserWorkItem(new WaitCallback(YourMethod), ParameterToBeUsedByMethod);

YourMethod(object o){ Your Code here... }

For more reading please follow the link http://msdn.microsoft.com/en-us/library/3dasc8as%28VS.80%29.aspx

Hope, this helps

score 2 · Accepted Answer

与其直接处理线程或管理线程池，我建议使用更高级别的库，如Parallel Extensions (PEX)：

var filesContent = from file in enumerableOfFilesToProcess
                   select new 
                   {
                       File=file, 
                       Content=File.ReadAllText(file)
                   };

var processedContent = from content in filesContent
                       select new 
                       {
                           content.File, 
                           ProcessedContent = ProcessContent(content.Content)
                       };

var dictionary = processedContent
           .AsParallel()
           .ToDictionary(c => c.File);

PEX 将根据可用内核和负载处理线程管理，同时您可以专注于手头的业务逻辑（哇，这听起来像商业广告！）

PEX 是 .Net Framework 4.0 的一部分，但也可以作为Reactive Framework的一部分提供到 3.5 的反向端口。

score 2 · Accepted Answer

我建议你有有限数量的线程（比如 4 个），然后有 4 个工作池。即，如果您有 400 个文件要处理，则每个线程平均拆分 100 个文件。然后，您生成线程，并将它们的工作传递给每个线程，并让它们运行，直到它们完成特定的工作。

您只有一定数量的 I/O 带宽，因此线程过多不会带来任何好处，还请记住，创建线程也需要少量时间。

score 1 · Accepted Answer

我建议使用CCR（并发和协调运行时），它将为您处理低级线程细节。至于您的策略，每个工作项一个线程可能不是最佳方法，具体取决于您尝试写入字典的方式，因为您可能会产生严重的争用，因为字典不是线程安全的。

这是一些使用 CCR 的示例代码，Interleave 在这里可以很好地工作：

Arbiter.Activate(dispatcherQueue, Arbiter.Interleave(
    new TeardownReceiverGroup(Arbiter.Receive<bool>(
        false, mainPort, new Handler<bool>(Teardown))),
    new ExclusiveReceiverGroup(Arbiter.Receive<object>(
        true, mainPort, new Handler<object>(WriteData))),
    new ConcurrentReceiverGroup(Arbiter.Receive<string>(
        true, mainPort, new Handler<string>(ReadAndProcessData)))));

public void WriteData(object data)
{
    // write data to the dictionary
    // this code is never executed in parallel so no synchronization code needed
}

public void ReadAndProcessData(string s)
{
    // this code gets scheduled to be executed in parallel
    // CCR take care of the task scheduling for you
}

public void Teardown(bool b)
{
    // clean up when all tasks are done
}

score 1 · Accepted Answer

从长远来看，我认为如果您管理自己的线程，您会更快乐。这将让您控制正在运行的数量并轻松报告状态。

构建一个执行处理的工作类，并给它一个回调例程以返回结果和状态。
对于每个文件，创建一个工作实例和一个线程来运行它。将线程放在一个Queue.
将线程从队列中剥离到您希望同时运行的最大值。随着每个线程完成去获取另一个。调整最大值并测量吞吐量。我更喜欢使用 aDictionary来保存正在运行的线程，由它们的ManagedThreadId.
要早点停下来，只需清除队列即可。
在你的线程集合周围使用锁定来保持你的理智。

score 0 · Accepted Answer

Using the ThreadPool for each individual task is definitely a bad idea. From my experience this tends to hurt performance more than helping it. The first reason is that a considerable amount of overhead is required just to allocate a task for the ThreadPool to execute. By default, each application is assigned it's own ThreadPool that is initialized with ~100 thread capacity. When you are executing 400 operations in a parallel, it does not take long to fill the queue with requests and now you have ~100 threads all competing for CPU cycles. Yes the .NET framework does a great job with throttling and prioritizing the queue, however, I have found that the ThreadPool is best left for long-running operations that probably won't occur very often (loading a configuration file, or random web requests). Using the ThreadPool to fire off a few operations at random is much more efficient than using it to execute hundreds of requests at once. Given the current information, the best course of action would be something similar to this:

Create a System.Threading.Thread (or use a SINGLE ThreadPool thread) with a queue that the application can post requests to
Use the FileStream's BeginRead and BeginWrite methods to perform the IO operations. This will cause the .NET framework to use native API's to thread and execute the IO (IOCP).

This will give you 2 leverages, one is that your requests will still get processed in parallel while allowing the operating system to manage file system access and threading. The second is that because the bottleneck of the vast majority of systems will be the HDD, you can implement a custom priority sort and throttling to your request thread to give greater control over resource usage.

Currently I have been writing a similar application and using this method is both efficient and fast... Without any threading or throttling my application was only using 10-15% CPU, which can be acceptable for some operations depending on the processing involved, however, it made my PC as slow as if an application was using 80%+ of the CPU. This was the file system access. The ThreadPool and IOCP functions do not care if they are bogging the PC down, so don't get confused, they are optimized for performance, even if that performance means your HDD is squeeling like a pig.

The only problem I have had is memory usage ran a little high (50+ mb) during the testing phaze with approximately 35 streams open at once. I am currently working on a solution similar to the MSDN recommendation for SocketAsyncEventArgs, using a pool to allow x number of requests to be operating simultaneously, which ultimately led me to this forum post.

Hope this helps somebody with their decision making in the future :)

score 0 · Accepted Answer

用于ThreadPool.QueueUserWorkItem执行每个独立的任务。绝对不要创建数百个线程。这可能会导致严重的头痛。

score 0 · Accepted Answer

使用 ThreadPool 的一般规则是，如果您不想担心线程何时完成（或使用互斥锁来跟踪它们），或者担心停止线程。

那么你需要担心什么时候完成工作吗？如果没有，线程池是最好的选择。如果您想跟踪整体进度，请停止线程，那么您自己的线程集合是最好的。

如果您重用线程，ThreadPool 通常会更有效。这个问题会给你一个更详细的讨论。

Hth

c# - C#多线程文件IO（读取）

8 回答 8

Related

Reference