标题：Yet Another False-Sharing Test
出处：Felix021
时间：Sun, 03 Mar 2013 23:31:51 +0000
作者：felix021
地址：https://www.felix021.com/blog/read.php?2106

内容：
前天在 coolshell 里看到 并发框架Disruptor译文 以后 ，才感慨了CPU娘的傲娇，没一会儿就看到 Dutor 同学的 A False-Sharing Test ，发现差距好大（4线程4倍- ，16线程8倍+ ，我用dutor的代码实测16线程性能差距接近20倍），于是也写了段小代码来测试它。跟dutor同学不一样，我用的是 c 实现的，看起来可能没那么易读。

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/time.h>
#include <limits.h>

void *tester(void *arg)
{
    long *nloop = (long *)arg; //这里之前笔误写成int了。
    while ( (*nloop)-- );
    return NULL;
}

int driver(int nthread, int nloop, int npad)
{
    size_t size = npad + sizeof(long); //每个线程占用sizeof(long) + npad的空间
    char buff[size * nthread];
    pthread_t th[nthread];

    struct timeval s, e;
    gettimeofday(&s, NULL);

    for (int i = 0; i < nthread; i++) {
        int *arg = (int *)(buff + size * i); 
        *arg = nloop;
        pthread_create(&th[i], NULL, tester, (void *)arg);
    }

    void *pret;
    for (int i = 0; i < nthread; i++) 
        pthread_join(th[i], &pret);

    gettimeofday(&e, NULL);

    return (e.tv_sec - s.tv_sec) * 1000000 + e.tv_usec - s.tv_usec;
}

int main()
{
    int nloop = 1024 * 1024 * 128, nthread = 16, npad, best_padding = 0, best_usage = INT_MAX;
    printf("nloop = %d, nthread = %d\n\n", nloop, nthread);
    for (npad = 64; npad >= 0; npad -= 8) { //之所以步长为8是为了避免非8字节对齐long可能有的性能损失
        int i, usage = 0;;
        for (i = 0; i < 3; i++)
            usage += driver(nthread, nloop, npad);
        usage /= 3;
        if (usage < best_usage) {
            best_usage = usage;
            best_padding = npad;
        }
        printf("padding: %2d, time usage: %12d\n", npad, usage);
    }
    printf("\nbest padding: %2d, time usage: %12d\n", best_padding, best_usage);
    return 0;
}

引用
$ gcc false_sharing.c -lpthread -std=c99
$ ./a.out
nloop = 134217728, nthread = 16

padding: 64, time usage:       491395
padding: 56, time usage:       477760
padding: 48, time usage:       853594
padding: 40, time usage:       834318
padding: 32, time usage:       905200
padding: 24, time usage:       940989
padding: 16, time usage:       991595
padding:  8, time usage:      1040412
padding:  0, time usage:      1112716

best padding: 56, time usage:       477760


该机器使用的是4颗4核8线程的Xeon E7520@1.87GHz （16个物理核心32个逻辑核心），64GB RAM，/proc/cpuinfo里的cache_alignment是64

可以看出来，padding=56（也就是正好对齐到一个cache行）的时候效率最高，是没有填充时的2倍+的效率，虽然明显，但是显著地没有dutor的测试那么夸张。

把dutor的代码稍微改了下，s[ith].n = NLOOP，且pthread_create的时候传入的参数改成 (void *)&(s[ith].n)，然后hook程序改成size_t *n = (size_t *)args;
while ( (*n)-- );
return NULL;

其运行效率提升显著，padding=56的时候能快10%左右，而padding=0的时候能快达7倍之巨，最终的性能差距大约可以降至 3 倍的差距。这说明dutor的测试方法并不是测试裸的性能差距，带来的了一定的误差。

由于现在多数CPU都已经有了共享的L2或者L3 Cache，Cache Line失效的问题得到了相当的改善，不过不同物理CPU上仍然需要注意这个问题。

然而有一点我不能理解，这个修改对两种情况的影响竟相差这么大，这里头又有什么玄机呢...... #UPDATE: 后根据dutor的测试，我去掉了 for 循环中用到的循环变量 i 之后，性能差距立即将至2倍左右，修改循环的方向或者将for改成while则无效，因此这很可能是分支预测失效带来的问题了。


Generated by Bo-blog 2.1.0