c++ - 2个线程如何共享同一个缓存行

Question

我正在使用自定义网络协议库。这个库建立在 TCP/IP 之上，应该用于高频消息传递。它是一个非阻塞库，并使用回调作为与调用者集成的接口。

我不是性能专家，这就是为什么我决定在这里问这个问题。自定义库带有一个特定的约束，概述如下：

“被调用者不应在回调线程的上下文中调用任何库的 API。如果他们尝试这样做，线程将挂起”

克服 API 限制的唯一方法是我启动另一个线程来处理消息并调用库以发送响应。库线程和进程线程将共享一个公共队列，该队列受互斥体保护并使用wait_notify()调用来指示消息的存在。

如果我每秒接收 80k 条消息，那么我会让线程进入睡眠状态并经常唤醒它们，每秒执行约 80k 次线程上下文切换。

另外，由于有两个线程，它们不会共享 L1 缓存中的消息缓冲区。包含消息的缓存行将首先由库的线程填充，然后被驱逐并拉入进程线程的核心的 L1 缓存。我是否遗漏了什么，或者库的设计是否可能不适用于高性能用例？

我的问题是：

我已经看到诸如“不要在回调的上下文中使用此 API，因为它可能导致锁定”之类的警告。跨越许多图书馆。导致此类设计约束的常见设计选择是什么？如果这是同一个线程多次调用锁的简单问题，他们可以使用递归锁。这是一个可重入问题吗？哪些挑战可能会导致 API 所有者制作不可重入 API？
在上述设计模型中，库线程和进程线程是否可以共享同一个内核，从而共享一个缓存线？
sig_atomic_tvolatile作为在两个线程之间共享数据的机制有多贵？
给定高频场景，在两个线程之间共享信息的轻量级方法是什么？

该库和我的应用程序是基于 C++ 和 Linux 构建的。

score 6 · Accepted Answer

2个线程如何共享同一个缓存行？

线程与缓存行无关。至少不是明确的。可能出问题的是上下文切换和 TLB 失效时的缓存刷新，但是给定线程的相同虚拟地址映射，缓存通常应该忽略这些事情。

导致此类设计约束的常见设计选择是什么？

库的实现者不想处理：

复杂的锁定方案。
重入逻辑（即你调用'send()'，库用回调on_error()你，你再次调用send()- 这需要他们特别注意）。

我个人认为，当涉及到高性能，尤其是与网络相关的事情时，围绕回调设计 API 是一件非常糟糕的事情。尽管有时它使用户和开发人员的生活变得更加简单（仅就编写代码而言）。唯一的例外可能是 CPU 中断处理，但那是另一回事，您很难将其称为 API。

如果这是同一个线程多次调用锁的简单问题，他们可以使用递归锁。

递归互斥锁相对来说非常昂贵。关心运行时效率的人倾向于尽可能避免使用它们。

在上述设计模型中，库线程和进程线程是否可以共享同一个内核，从而共享一个缓存线？

是的。您必须将两个线程固定到同一个 CPU 内核。例如，通过使用sched_setaffinity(). 但这也超出了单个程序的范围——必须正确配置整个环境。例如，您可能要考虑不允许操作系统在该内核上运行任何东西，但您的两个线程（包括中断），并且不允许这两个线程迁移到不同的 CPU。

volatile sig_atomic_t 作为在两个线程之间共享数据的机制有多昂贵？

它本身并不昂贵。然而，在多核环境中，您可能会出现一些缓存失效、停顿、MESI 流量增加等问题。鉴于两个线程都在同一个核心上并且没有任何干扰——唯一的惩罚是无法缓存变量，这是可以的，因为它不应该被缓存（即编译器总是从内存中获取它，无论是缓存还是主内存）。

给定高频场景，在两个线程之间共享信息的轻量级方法是什么？

从/向同一内存读取和写入。可能没有任何系统调用、阻塞调用等。例如，至少对于英特尔架构，可以通过使用内存屏障来实现两个并发线程的环形缓冲区，仅此而已。为了做到这一点，你必须非常注重细节。然而，如果某些东西必须显式同步，那么原子指令是下一个层次。Haswell 还带有可用于低开销同步的事务内存。在那之后没有什么是快速的。

另外，请查看英特尔架构开发人员手册第 11 章，关于内存缓存和控制。

score 1 · Accepted Answer

An important thing to keep in mind here is that when working on network applications, the more important performance metric is "latency-per-task" and not the raw cpu cycle throughput of the entire application. To that end, thread message queues tend to be a very good method for responding to activity in the quickest possible fashion.

80k messages per second on today's server infrastructure (or even my Core i3 laptop) is bordering on being insignificant territory -- especially insofar as L1 cache performance is concerned. If the threads are doing a significant amount of work, then its not unreasonable at all to expect the CPU to flush through the L1 cache every time a message is processed, and if the messages are not doing very much work at all, then it just doesn't matter because its probably going to be less than 1% of the CPU load regardless of L1 policy.

At that rate of messaging I would recommend a passive threading model, eg. one where threads are woken up to handle messages and then fall back asleep. That will give you the best latency-vs-performance model. Eg, its not the most performance-efficient method but it will be the best at responding quickly to network requests (which is usually what you want to favor when doing network programming).

On today's architectures (2.8ghz, 4+ cores), I wouldn't even begin to worry about raw performance unless I expected to be handling maybe 1 million queued messages per second. And even then, it'd depend a bit on exactly how much Real Work the messages are expected to perform. It it isn't expected to do much more than prep and send some packets, then 1 mil is definitely conservative.

Is there a way in the above design model, where the library thread and process thread can share the same core, and consequently share a cache line?

No. I mean, sure there is if you want to roll your own Operating System. But if you want to run in a multitasking environment with the expectation of sharing the CPU with other tasks, then "No." And locking threads to cores is something that is very likely to hurt your threads' avg response times, without providing much in the way of better performance. (and any performance gain would be subject to the system being used exclusively for your software and would probably evaporate on a system running multiple tasks)

Given a high frequency scenario, what is a light-weight way to share information between two threads?

Message queues. :) Seriously. I don't mean to sound silly, but that's what message queues are. They share information between two threads and they're typically light-weight about it. If you want to reduce context switches, only signal the worker to drain the queue after some number of messages have accumulated (or some timeout period, in case of low activity) -- but be weary that will increase your program's response time/latency.

c++ - 2个线程如何共享同一个缓存行

2 回答 2

Related

Reference