谈性能?
代码可以运行得更快~ +4x (!) for 1E6 loops
不管是否使用 OpenMP 工具,让我们从这些开始。OpenMP 线程管理(实例化、任务分发、结果收集 -(a smart reduction(+:sum2)
)**都需要一些附加成本 - 请参阅花费在此上的汇编指令的数量**(比例))。
鉴于您的#pragma
-decorated 代码已经支付了所有这些附加成本(它按照指示做了),您几乎没有得到任何回报1E6 doubles
以换取烧毁的费用 - 但是( 1E6的减少总和几乎只是一个语法 -糖,如果与无附加成本的纯 [SERIAL] 代码执行相比,如果聪明(甚至小于如果不是),那么与不消耗线程管理和任务的附加费用相比,它的总和是相同的~ 18 [ms]
~ 70 [ms]
分发/结果收集开销(此处~ 400 [ms]
用于2 核沙盒演示测试),
0.01000
1000000
the true value is 0.500000 the number of threads we use is 2
OpenMP as-is Time: 0.467055
SERIAL as-is Time: 0.069820 <~~+ 70 [ms] @ 1E6 loops
OpenMP Re-DEF'd Time: 0.328225 | !!
SERIAL Re-DEF'd Time: 0.017899 <~~+~ 6x FASTER 18 [ms] @ 1E6 loops
勘误:mea culpa - 代码避免了一个fDIV
用于底部情况的代码(重新测试显示 ~ +10% 加速 - 见代码)
测试与 1E6 一样少的循环(@-a-few-GHz-CPU-cores ...)会产生比事实更多的噪音。无论如何,我们可以在任何一种策略中变得更快:
OpenMP as-is Time: 0.375825 the estimate value is 0.5000982178
SERIAL as-is Time: 0.062920 the estimate value is 0.5004906150
|||
~300 [us]FASTER--v
OpenMP Re-DEF'd Time: 0.062613 the estimate value is 0.4992401088
~2 [ms]SLOWER--^
||||
SERIAL Re-DEF'd Time: 0.060253 the estimate value is 0.4995912559
公平地注意到,对于更长的循环,循环递增的开销将产生更多的总计算时间,即使使用-O3
,所以重构的代码将表现出所有时间最快的结果,但加速因子将变得更小到~ 1.16x
for25E6
循环
核心缺陷:
非常糟糕的成本不平衡:影响
会损害任何计算的效率
实际上,在最昂贵的语法构造函数(更不用说随机数)中几乎没有什么需要计算(几个fADD
-s, fMUL
, fNEG
, fDIV
-s, ),这至少无法证明那些花费在构建 OpenMP 代码上的成本是合理的- 执行生态系统(然而,我们将展示它甚至可以减少 6 倍以更快的运行)。fLOG
为什么要重新计算,百万+倍的常数值越少?
因此,让我们清除不应该进入任何以绩效为导向的部分的东西:
double C_LogRAND_MAX = log( (double) RAND_MAX );
double C_1divLENMUTA = -1 / lenmuta;
double C_2sub = C_LogRAND_MAX * C_1divLENMUTA;
和 :
#pragma omp parallel for private( i, y ) reduction( +:sum2 )
for( i = 0; i < N; i++ )
{
sum2 += cos( C_1divLENMUTA // fMUL
* log( (double) rand() ) // +costs of keeping the seed-state management
- C_2sub // fSUB
);
}
最后但并非最不重要的一点是,随机序列的并行来源值得再次仔细研究,因为这些工具试图保持其内部状态,这可能会“跨”线程产生麻烦。好消息是,Stack Overflow 可以为解决这个性能问题提供很多帮助。
w/o -O3: =:_____________________________________________________________________:[ns]
SERIAL NOP Time: 3.867352 DIV( 2000000000 ) ~ 0.000000002 ~ 2 [ns]:||:for(){...}loop-overhead :
SERIAL RAND() Time: 10.845900 DIV( 1000000000 ) ~ 0.000000011 ~ +9 [ns]: |||||||||:rand() :
SERIAL (double) RAND() Time: 10.923597 DIV( 1000000000 ) ~ 0.000000011 ~ +0 [ns]: :(double) :
SERIAL LOG( (double) RAND() ) Time: 37.156017 DIV( 1000000000 ) ~ 0.000000037 ~ +27 [ns]: |||||||||||||||||||||||||||:log() :
SERIAL COS( LOG( (double) RAND() ) ) Time: 54.472115 DIV( 800000000 ) ~ 0.000000068 ~ +31 [ns]: |||||||||||||||||||||||||||||||:cos()
SERIAL COS( LOG( (double) RAND() ) ) Time: 55.320798 DIV( 800000000 ) ~ 0.000000069 : w/o -O3 :
w/-O3: :::( :::( (::::::) ::::() ) ) !!. :::( ::::::::: ) !! =:____________: :! :
SERIAL COS( LOG( (double) RAND() ) ) Time: 9.305908 DIV( 800000000 ) ~ 0.000000012 ~ 12 [ns]:|||||||||||| with -O3 :
SERIAL COS( LOG( (double) RAND() ) ) Time: 2.135143 DIV( 200000000 ) ~ 0.000000011 : :
SERIAL LOG( (double) RAND() ) Time: 2.082406 DIV( 200000000 ) ~ 0.000000010 : :
SERIAL (double) RAND() Time: 2.244600 DIV( 200000000 ) ~ 0.000000011
SERIAL RAND() Time: 2.101538 DIV( 200000000 ) ~ 0.000000011
SERIAL NOP Time: 0.000000 DIV( 200000000 ) ~ 0.000000000
^^
||
!||
vvv
OpenMP COS( LOG( (double) RAND() ) ) Time: 33.336248 DIV( 100000000 ) ~ 0.000000333 #pragma omp parallel num_threads( 2 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time: 0.388479 DIV( 1000000 ) ~ 0.000000388 #pragma omp parallel num_threads( 2 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time: 37.636114 DIV( 100000000 ) ~ 0.000000376 #pragma omp parallel num_threads( 2 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time: 38.876272 DIV( 100000000 ) ~ 0.000000389 #pragma omp parallel num_threads( 2 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time: 44.226391 DIV( 100000000 ) ~ 0.000000442 #pragma omp parallel num_threads( 2 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time: 0.333573 DIV( 1000000 ) ~ 0.000000334 #pragma omp parallel num_threads( 4 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time: 35.624111 DIV( 100000000 ) ~ 0.000000356 #pragma omp parallel num_threads( 4 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time: 37.820558 DIV( 100000000 ) ~ 0.000000378 #pragma omp parallel num_threads( 4 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time: 38.625498 DIV( 100000000 ) ~ 0.000000386 #pragma omp parallel num_threads( 4 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time: 39.782386 DIV( 100000000 ) ~ 0.000000398 #pragma omp parallel num_threads( 4 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time: 0.317120 DIV( 1000000 ) ~ 0.000000317 #pragma omp parallel num_threads( 8 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time: 34.692555 DIV( 100000000 ) ~ 0.000000347 #pragma omp parallel num_threads( 8 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time: 0.360407 DIV( 1000000 ) ~ 0.000000360 #pragma omp parallel num_threads( 8 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time: 3.737517 DIV( 10000000 ) ~ 0.000000374 #pragma omp parallel num_threads( 8 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time: 0.380087 DIV( 1000000 ) ~ 0.000000380 #pragma omp parallel num_threads( 8 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time: 0.354283 DIV( 1000000 ) ~ 0.000000354 #pragma omp parallel num_threads( 16 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time: 35.984292 DIV( 100000000 ) ~ 0.000000360 #pragma omp parallel num_threads( 16 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time: 3.654442 DIV( 10000000 ) ~ 0.000000365 #pragma omp parallel num_threads( 16 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time: 37.233374 DIV( 100000000 ) ~ 0.000000372 #pragma omp parallel num_threads( 16 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time: 4.112637 DIV( 10000000 ) ~ 0.000000411 #pragma omp parallel num_threads( 16 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time: 37.813872 DIV( 100000000 ) ~ 0.000000378 #pragma omp parallel num_threads( 32 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time: 0.412896 DIV( 1000000 ) ~ 0.000000413 #pragma omp parallel num_threads( 32 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time: 34.098855 DIV( 100000000 ) ~ 0.000000341 #pragma omp parallel num_threads( 32 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time: 35.372660 DIV( 100000000 ) ~ 0.000000354 #pragma omp parallel num_threads( 32 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time: 39.256430 DIV( 100000000 ) ~ 0.000000393 #pragma omp parallel num_threads( 32 ) with -O3
/////////////////////////////////////////////////////////////////////////////////////////////
//
// -O3
// warning: iteration 2147483647 invokes undefined behavior [-Waggressive-loop-optimizations]
// for( i = 0; i < N; i++ )
// ^~~
********************************************************************************************/