cuda - __threadfence() 和 L1 缓存一致性

Question

这是我的理解（参见例如How can I enforce CUDA global memory coherence without declaring pointer as volatile?，GTS 250 和 Fermi 设备之间的 CUDA 块同步差异以及nvidia Developer Zone 中的这篇文章）__threadfence()保证全局写入将是可见的在线程继续之前到其他线程。但是，即使在__threadfence()返回后，另一个线程仍然可以从其 L1 缓存中读取陈旧值。

那是：

线程 A 将一些数据写入全局内存，然后调用__threadfence(). 然后，在返回后 的某个时间__threadfence()，写入对所有其他线程可见，线程 B 被要求从该内存位置读取。它发现它在 L1 中有数据，所以加载它。不幸的是，对于开发人员来说，线程 B 的 L1 中的数据是陈旧的（即与线程 A 更新此数据之前一样）。

首先：这是正确的吗？

假设是这样，那么在我看来，__threadfence()只有当任何一个人可以确定数据不会在 L1 中（有点不太可能？）或者例如读取总是绕过 L1（例如 volatile 或 atomics）时，它才有用。这个对吗？

我问是因为我有一个相对简单的用例 - 将数据传播到二叉树 - 使用原子设置的标志和__threadfence()：到达节点的第一个线程退出，第二个线程根据它的两个孩子（例如最少的数据）。这适用于大多数节点，但通常至少有一个节点失败。声明数据volatile会给出始终如一的正确结果，但会在 99% 以上没有从 L1 中获取过时值的情况下导致性能下降。我想确定这是该算法的唯一解决方案。下面给出一个简化的例子。请注意，节点数组是按广度优先排序的，叶子从索引开始start并且已经填充了数据。

__global__ void propagate_data(volatile Node *nodes,
                               const unsigned int n_nodes,
                               const unsigned int start,
                               unsigned int* flags)
{
    int tid, index, left, right;
    float data;
    bool first_arrival;

    tid = start + threadIdx.x + blockIdx.x*blockDim.x;

    while (tid < n_nodes)
    {
        // We start at a node with a full data section; modify its flag
        // accordingly.
        flags[tid] = 2;

        // Immediately move up the tree.
        index = nodes[tid].parent;
        first_arrival = (atomicAdd(&flags[index], 1) == 0);

        // If we are the second thread to reach this node then process it.
        while (!first_arrival)
        {
            left = nodes[index].left;
            right = nodes[index].right;

            // If Node* nodes is not declared volatile, this occasionally
            // reads a stale value from L1.
            data = min(nodes[left].data, nodes[right].data);

            nodes[index].data = data;

            if (index == 0) {
                // Root node processed, so all nodes processed.
                return;
            }

            // Ensure above global write is visible to all device threads
            // before setting flag for the parent.
            __threadfence();

            index = nodes[index].parent;
            first_arrival = (atomicAdd(&flags[index], 1) == 0);
        }
        tid += blockDim.x*gridDim.x;
    }
    return;
}

score 4 · Accepted Answer

首先：这是正确的吗？

是的，__threadfence()将数据推送到 L2 并推送到全局内存。它对其他SM 中的 L1 缓存没有影响。

这个对吗？

是的，如果您结合__threadfence()使用volatile全局内存访问，您应该确信这些值最终会对其他线程块可见。但是请注意，线程块之间的同步在 CUDA 中并不是一个明确定义的概念。没有明确的机制可以做到这一点，也不能保证线程块执行的顺序，所以仅仅因为你的代码在__threadfence()某个volatile项目上运行了某个地方，仍然不能真正保证另一个线程块可以获取什么数据。这也取决于执行顺序。

如果您使用，则应该绕过volatileL1（如果启用 -当前的 Kepler 设备并没有真正启用 L1 以进行一般全局访问）。如果您不使用volatile，那么当前正在执行__threadfence()操作的 SM 的 L1 应该在__threadfence()操作完成时与 L2（和全局）一致/连贯。

请注意，L2 缓存在整个设备中是统一的，因此始终是“连贯的”。对于您的用例，至少从设备代码的角度来看，L2 和全局内存之间没有区别，无论您在哪个 SM 上。

而且，正如您所指出的，（全局）原子始终在 L2/全局内存上运行。

cuda - __threadfence() 和 L1 缓存一致性

1 回答 1

Related

Reference