4

I'm using Intel IPP for multiplication of 2 Images (Arrays).
I'm using Intel IPP 8.2 which comes with Intel Composer 2015 Update 6.

I created a simple function to multiply too large images (The whole project is attached, see below).
I wanted to see the gains using Intel IPP Multi Threaded Library.

Here is the simple project (I also attached the complete project form Visual Studio):

#include "ippi.h"
#include "ippcore.h"
#include "ipps.h"
#include "ippcv.h"
#include "ippcc.h"
#include "ippvm.h"

#include <ctime>
#include <iostream>

using namespace std;

const int height = 6000;
const int width  = 6000;
Ipp32f mInput_image [1 * width * height];
Ipp32f mOutput_image[1 * width * height] = {0};

int main()
{
    IppiSize size = {width, height};

    double start = clock();

    for (int i = 0; i < 200; i++)
        ippiMul_32f_C1R(mInput_image, 6000 * 4, mInput_image, 6000 * 4, mOutput_image, 6000 * 4, size); 

    double end = clock();
    double douration = (end - start) / static_cast<double>(CLOCKS_PER_SEC);

    cout << douration << endl;
    cin.get();

    return 0;
}

I compiled this project once using Intel IPP Single Threaded and once using Intel IPP Multi Threaded.

I tried different sizes of arrays and in all of them the Multi Threaded version yields no gains (Sometimes it is even slower).

I wonder, how come there is no gain in this task with multi threading?
I know Intel IPP uses the AVX and I thought maybe the task becomes Memory Bounded?

I tried another approach by using OpenMP manually to have Multi Threaded approach using Intel IPP Single Thread implementation.
This is the code:

#include "ippi.h"
#include "ippcore.h"
#include "ipps.h"
#include "ippcv.h"
#include "ippcc.h"
#include "ippvm.h"

#include <ctime>
#include <iostream>

using namespace std;

#include <omp.h>

const int height = 5000;
const int width  = 5000;
Ipp32f mInput_image [1 * width * height];
Ipp32f mOutput_image[1 * width * height] = {0};

int main()
{
    IppiSize size = {width, height};

    double start = clock();

    IppiSize blockSize = {width, height / 4};

    const int NUM_BLOCK = 4;
    omp_set_num_threads(NUM_BLOCK);

    Ipp32f*  in;
    Ipp32f*  out;

    //  ippiMul_32f_C1R(mInput_image, width * 4, mInput_image, width * 4, mOutput_image, width * 4, size);

    #pragma omp parallel            \
    shared(mInput_image, mOutput_image, blockSize) \
    private(in, out)
    {
        int id   = omp_get_thread_num();
        int step = blockSize.width * blockSize.height * id;
        in       = mInput_image  + step;
        out      = mOutput_image + step;
        ippiMul_32f_C1R(in, width * 4, in, width * 4, out, width * 4, blockSize);
    }

    double end = clock();
    double douration = (end - start) / static_cast<double>(CLOCKS_PER_SEC);

    cout << douration << endl;
    cin.get();

    return 0;
}

The results were the same, again, no gain of performance.

Is there a way to benefit from Multi Threading in this kind of task?
How can I validate whether a task becomes memory bounded and hence no benefit in parallelize it? Are there benefit to parallelize task of multiplying 2 arrays on CPU with AVX?

The Computers I tried it on is based on Core i7 4770k (Haswell).

Here is a link to the Project in Visual Studio 2013.

Thank You.

4

3 回答 3

3

您的图像总共占用 200 MB(2 x 5000 x 5000 x 4 字节)。因此,每个块包含 50 MB 的数据。这是 CPU 的 L3 缓存大小的 6 倍多(参见此处)。每个 AVX 向量乘法对 256 位数据进行操作,即半个缓存行,即每条向量指令消耗一个缓存行(每个参数半个缓存行)。Haswell 上的向量化乘法有 5 个周期的延迟,FPU 可以在每个周期退出两条这样的指令(参见此处)。i7-4770K的内存总线额定25.6 GB/s(理论最大!)或每秒不超过 4.3 亿个高速缓存行。CPU 的标称速度为 3.5 GHz。AVX 部分的时钟频率略低,比如 3.1 GHz。在这样的速度下,每秒需要多一个数量级的缓存线才能完全满足 AVX 引擎的需求。

在这些情况下,向量化代码的单个线程几乎会完全填满 CPU 的内存总线。添加第二个线程可能会带来非常轻微的改进。添加更多线程只会导致争用并增加开销。加快这种计算的唯一方法是增加内存带宽:

  • 在具有更多内存控制器并因此具有更高总内存带宽的 NUMA 系统上运行,例如多插槽服务器主板;
  • 切换到具有更高内存带宽的不同架构,例如 Intel Xeon Phi 或 GPGPU。
于 2016-05-02T13:56:48.843 回答
1

如果您在启用超线程的情况下运行,您应该尝试使用每个内核 1 个线程的 openmp 版本的 ipp,如果 ipp 没有自动执行,请设置 omp_places=cores。如果您使用 Cilk_ ipp,请尝试改变 cilk_workers。您可能会尝试一个足够大的测试用例以跨越多个 4kb 页面。然后其他因素开始发挥作用。理想情况下,ipp 将使线程在不同的页面上工作。在 Linux(或 Mac?)上,透明的大页面应该启动。在 Windows 上,haswell CPU 引入了硬件页面预取,这应该会降低但不会消除 thp 的重要性。

于 2016-05-02T11:24:49.280 回答
1

根据我自己的一些研究,您的总 CPU 缓存似乎在 8MB 左右。6000*4/4(6000 个浮点数分成 4 个块)为 6MB。将此乘以 2(进出),您就在缓存之外。

我没有对此进行测试,但是增加块的数量应该会提高性能。尝试 8 开始(您的 CPU 将超线程支持到 8 个虚拟内核)。

目前,在 OpenMP 上产生的每个不同进程都存在缓存冲突,并且必须从主内存(重新)加载。减小块的大小可以帮助解决这个问题。拥有不同的缓存会有效地增加缓存的大小,但这似乎不是一个选择。

如果您只是将其作为原理证明,您可能希望通过在显卡上运行它来测试它。虽然,这可能更难以正确实施。

于 2016-05-01T18:10:02.903 回答