2

你能帮我看看当有更多的核心/缓存持有该行的副本时,缓存写入是否需要更长的时间才能完成。我还想测量/量化实际需要多长时间。

我在谷歌上找不到任何有用的东西,而且我自己测量它并解释我测量的东西时遇到了麻烦,因为现代处理器上可能发生很多事情。(重新排序、预取、缓冲和天知道是什么)

细节:

我测量它的基本过程大致如下:

write soemthing to the cacheline on processor 0
read it on processors 1 to n.

rdtsc
write it on process 0
rdtsc

我什至不确定在进程 0 上实际使用哪些指令进行读/写,以确保在最终时间测量之前完成写/无效。

目前我摆弄一个原子交换(__sync_fetch_and_add()),但似乎线程数本身对于这个操作的长度很重要(不是要失效的线程数)——这可能不是我想要的测量?!。

我还尝试了读取,然后写入,然后是内存屏障(__sync_synchronize())。这看起来更像我期望看到的,但在这里我也不确定在最终 rdtsc 发生时写入是否完成。

你可以猜到我对 CPU 内部的了解是有限的。

很感谢任何形式的帮助!

ps: * 我使用 linux、gcc 和 pthreads 进行测量。* 我想知道这一点,以便为我的并行算法建模。

编辑:

大约一周后(明天去度假),我会做更多的研究并发布我的代码和注释并将其链接到此处(以防有人感兴趣),因为我可以花在这方面的时间是有限的。

4

2 回答 2

4

我开始写一个很长的答案,准确描述它是如何工作的,然后意识到,我可能对确切的细节知之甚少。所以我会做一个简短的回答......

So, when you write something on one processor, if it's not already in that processors cache, it will have to be fetched in, and after the processor has read the data, it will perform the actual write. In doing so, it will send a cache-invalidate message to ALL other processors in the system. These will then throw away any content. If another processor has "dirty" content, it will in itself write out the data, and ask for an invalidation - in which case the first processor will have to RELOAD the data before finishing its write (otherwise, some other element in the same cacheline may get destroyed).

Reading it back into the cache will be required on every other processor that is interested in that cache-line.

The __sync_fetch_and_add() wilol use a "lock" prefix [on x86, other processors may vary, but the general idea on processors that support "per instruction" locks is roughtly the same] - this will issue a "I want this cacheline EXCLUSIVELY, everyone else please give it up and invalidate it". Just like the first case, the processor may well have to re-read anything that another processor may have made dirty.

A memory barrier will not ensure that data is updated "safely" - it will just make sure that "whatever happened (to memory) before now is visible to all processors by the time this instructon finishes".

The best way to optimize the use of processors is to share as little as possible, and in particular, avoid "false sharing". In a benchmark many years ago, there was a structure like [simplifed] this:

struct stuff {
    int x[2];
    ... other data ... total data a few cachelines. 
} data;

void thread1()
{
    for( ... big number ...)
        data.x[0]++;
}

void thread2()
{
    for( ... big number ...)
        data.x[1]++;
}

int main()
{
    start = timenow();

    create(thread1);
    create(thread2);

    end = timenow() - start;   
}

Since EVERY time thread1 wrote to the x[0], thread2's processor had to get rid of it's copy of x[1], and vice versa, the result is was that the SMP test [vs just running thread1] was running about 15 times slower. By altering the struct like this:

struct stuff {
    int x;
    ... other data ... 
} data[2];

and

void thread1()
{
    for( ... big number ...)
        data[0].x++;
}

we got 200% of the 1 thread variant [give or take a few percent]

Right, so the processor has queues of buffers where write operations are stored when the processor is writing to memory. A memory barrier (mfence, sfence or lfence) instruction is there to ensure that any outstanding read/write, write or read type operation has completely been finished before the processor proceeds to the next instruction. Normally, the processor would just continue on it's jolly way through any following instructions, and eventualy the memory operation becomes fulfilled some way or another. Since modern processors have a lot of parallel operations and buffers all over the place, it can take quite some time before something ACTUALLY trickles through to where it eventually will end up. So, when it's CRITICAL to make sure that something has ACTUALLY been done before proceeding (for example, if we have written a bunch of instructions to the video memory, and we now want to kick off the run of those instructions, we need to make sure that the 'instruction' writing has actually finished, and some other part of the processor isn't still working on finishing that. So use an sfence to make sure that the write has really happened - that may not be a very realistic example, but I think you get the idea.)

于 2012-12-26T18:29:16.270 回答
4

Cache writes have to get line-ownership before dirtying the cache line. Depending on the cache coherence model implemented in the processor architecture, the time taken for this step varies. The most common coherence protocols that I know are:

  • Snooping Coherence Protocol: all caches monitor address lines for cached memory lines i.e. all memory requests have to be broadcast to all cpus i.e. non-scalable as cpus increase.
  • Directory-based Coherence Protocol: all cache lines shared among many cpus is kept in a directory; so, invalidating/gaining ownership is a point-to-point cpu request rather than a broadcast i.e. more scalable, but latency suffers because the directory is a single point of contention.

Most cpu architectures support something called PMU (perf monitoring unit). This unit exports counters for many things like: cache hits, misses, cache write latency, read latency, tlb hits, etc. Please consult the cpu manual to see if this info is available.

于 2012-12-27T00:07:08.117 回答