I'm using Intel IPP for multiplication of 2 Images (Arrays).
I'm using Intel IPP 8.2 which comes with Intel Composer 2015 Update 6.
I created a simple function to multiply too large images (The whole project is attached, see below).
I wanted to see the gains using Intel IPP Multi Threaded Library.
Here is the simple project (I also attached the complete project form Visual Studio):
#include "ippi.h"
#include "ippcore.h"
#include "ipps.h"
#include "ippcv.h"
#include "ippcc.h"
#include "ippvm.h"
#include <ctime>
#include <iostream>
using namespace std;
const int height = 6000;
const int width = 6000;
Ipp32f mInput_image [1 * width * height];
Ipp32f mOutput_image[1 * width * height] = {0};
int main()
{
IppiSize size = {width, height};
double start = clock();
for (int i = 0; i < 200; i++)
ippiMul_32f_C1R(mInput_image, 6000 * 4, mInput_image, 6000 * 4, mOutput_image, 6000 * 4, size);
double end = clock();
double douration = (end - start) / static_cast<double>(CLOCKS_PER_SEC);
cout << douration << endl;
cin.get();
return 0;
}
I compiled this project once using Intel IPP Single Threaded and once using Intel IPP Multi Threaded.
I tried different sizes of arrays and in all of them the Multi Threaded version yields no gains (Sometimes it is even slower).
I wonder, how come there is no gain in this task with multi threading?
I know Intel IPP uses the AVX and I thought maybe the task becomes Memory Bounded?
I tried another approach by using OpenMP manually to have Multi Threaded approach using Intel IPP Single Thread implementation.
This is the code:
#include "ippi.h"
#include "ippcore.h"
#include "ipps.h"
#include "ippcv.h"
#include "ippcc.h"
#include "ippvm.h"
#include <ctime>
#include <iostream>
using namespace std;
#include <omp.h>
const int height = 5000;
const int width = 5000;
Ipp32f mInput_image [1 * width * height];
Ipp32f mOutput_image[1 * width * height] = {0};
int main()
{
IppiSize size = {width, height};
double start = clock();
IppiSize blockSize = {width, height / 4};
const int NUM_BLOCK = 4;
omp_set_num_threads(NUM_BLOCK);
Ipp32f* in;
Ipp32f* out;
// ippiMul_32f_C1R(mInput_image, width * 4, mInput_image, width * 4, mOutput_image, width * 4, size);
#pragma omp parallel \
shared(mInput_image, mOutput_image, blockSize) \
private(in, out)
{
int id = omp_get_thread_num();
int step = blockSize.width * blockSize.height * id;
in = mInput_image + step;
out = mOutput_image + step;
ippiMul_32f_C1R(in, width * 4, in, width * 4, out, width * 4, blockSize);
}
double end = clock();
double douration = (end - start) / static_cast<double>(CLOCKS_PER_SEC);
cout << douration << endl;
cin.get();
return 0;
}
The results were the same, again, no gain of performance.
Is there a way to benefit from Multi Threading in this kind of task?
How can I validate whether a task becomes memory bounded and hence no benefit in parallelize it?
Are there benefit to parallelize task of multiplying 2 arrays on CPU with AVX?
The Computers I tried it on is based on Core i7 4770k (Haswell).
Here is a link to the Project in Visual Studio 2013.
Thank You.