问:是因为两个 Jetson nano 板之间的通信时间(1) 真的很大,还是我的实现(2)没有利用这种工作方式? CyclicDist
第二种选择是肯定的:小尺寸数据的性能~ 100 x
更差。CyclicDist
文档对此明确警告说:
循环分布以从给定索引开始的循环模式将索引映射到语言环境。
...
限制
此发行版尚未针对性能进行调整。
对处理效率的不利影响在单区域设置平台上得到证明,其中所有数据都驻留在区域设置本地存储空间中,因此不会增加任何 NUMA 板间通信附加成本。与Vass 的单次迭代 sum-product相比,实现了更~ 100 x
差的性能forall{}
D3
(直到现在才注意到 Vass 的性能促使从原始版本forall-in-D3-do-{}
变为另一个配置 forall-in-D2-do-for{}
的 -tandem-iterator 修订版 - 到目前为止,小尺寸 --fast --ccflags -O3 执行的测试显示 -iterator-的长度几乎只有一半forall-in-D2-do-for{}
in-iterator 结果,甚至比 O/P Triple- forall{}
original 提案更差,除了尺寸低于 512x512 和进行了 -O3 优化之后,但对于最小尺寸 128x128原始 Vass-D3
实现了每个单元的最高性能~ 850 [ns]
独奏迭代器,令人惊讶的是没有 --ccflags -O3 (对于正在处理的更大数据布局,这可能会明显改变,--size={ 1024 | 2048 | 4096 | 8192 }
如果更广泛的 NUMA 多语言环境和更高并行度的设备被投入比赛,则更多))
TiO.run platform uses 1 numLocales,
having 2 physical CPU-cores accessible (numPU-s)
with 2 maxTaskPar parallelism limit
使用CyclicDist
DATA 到内存布局的效果,不是吗?
通过对小尺寸 的测量进行验证,--size={128 | 256 | 512 | 640}
有或没有轻微--ccflags -O3
影响
// --------------------------------------------------------------------------------------------------------------------------------
// --fast
// ------
//
// For grid{1,2,3}[ 128, 128] the tested forall sum-product over dmapped Cyclic Space took 255818 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product took 3075 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the Vass-D2-k ver sum-product took 3040 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product took 2198 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the Vass-D3 orig sum-product took 1974 [us] excl. fillRandom()-ops <-- 127x SLOWER with CyclicDist dmapped DATA
// For grid{1,2,3}[ 128, 128] the Vass-D2-k ver sum-product took 2122 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product over dmapped Cyclic Space took 252439 [us] excl. fillRandom()-ops
//
// For grid{1,2,3}[ 256, 256] the tested forall sum-product over dmapped Cyclic Space took 2141444 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product took 27095 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the Vass-D2-k ver sum-product took 25339 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product took 23493 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the Vass-D3 orig sum-product took 21631 [us] excl. fillRandom()-ops <-- 98x SLOWER then w/o CyclicDist dmapped data
// For grid{1,2,3}[ 256, 256] the Vass-D2-k ver sum-product took 21971 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product over dmapped Cyclic Space took 2122417 [us] excl. fillRandom()-ops
//
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 16988685 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17448207 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product took 268111 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the Vass-D2-k ver sum-product took 270289 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product took 250896 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the Vass-D3 orig sum-product took 239898 [us] excl. fillRandom()-ops <-- 71x SLOWER with dmapped CyclicDist DATA
// For grid{1,2,3}[ 512, 512] the Vass-D2-k ver sum-product took 257479 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17391049 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 16932503 [us] excl. fillRandom()-ops <~~ ~2e5 [us] faster without --ccflags -O3
//
// For grid{1,2,3}[ 640, 640] the tested forall sum-product over dmapped Cyclic Space took 35136377 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the tested forall sum-product took 362205 [us] incl. fillRandom()-ops <-- 97x SLOWER with dmapped CyclicDist DATA
// For grid{1,2,3}[ 640, 640] the Vass-D2-k ver sum-product took 367651 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the tested forall sum-product took 345865 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the Vass-D3 orig sum-product took 337896 [us] excl. fillRandom()-ops <-- 103x SLOWER with dmapped CyclicDist DATA
// For grid{1,2,3}[ 640, 640] the Vass-D2-k ver sum-product took 351101 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the tested forall sum-product over dmapped Cyclic Space took 35052849 [us] excl. fillRandom()-ops <~~ ~3e4 [us] faster without --ccflags -O3
//
// --------------------------------------------------------------------------------------------------------------------------------
// --fast --ccflags -O3
// --------------------
//
// For grid{1,2,3}[ 128, 128] the tested forall sum-product over dmapped Cyclic Space took 250372 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product took 3189 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the Vass-D2-k ver sum-product took 2966 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product took 2284 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the Vass-D3 orig sum-product took 1949 [us] excl. fillRandom()-ops <-- 126x FASTER than with dmapped CyclicDist DATA
// For grid{1,2,3}[ 128, 128] the Vass-D2-k ver sum-product took 2072 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product over dmapped Cyclic Space took 246965 [us] excl. fillRandom()-ops
//
// For grid{1,2,3}[ 256, 256] the tested forall sum-product over dmapped Cyclic Space took 2114615 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product took 37775 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the Vass-D2-k ver sum-product took 38866 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product took 32384 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the Vass-D3 orig sum-product took 29264 [us] excl. fillRandom()-ops <-- 71x FASTER than with dmapped CyclicDist DATA
// For grid{1,2,3}[ 256, 256] the Vass-D2-k ver sum-product took 33973 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product over dmapped Cyclic Space took 2098344 [us] excl. fillRandom()-ops
//
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17136826 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17081273 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product took 251786 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the Vass-D2-k ver sum-product took 266766 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product took 239301 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the Vass-D3 orig sum-product took 233003 [us] excl. fillRandom()-ops <~~ ~6e3 [us] faster with --ccflags -O3
// For grid{1,2,3}[ 512, 512] the Vass-D2-k ver sum-product took 253642 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17025339 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17081352 [us] excl. fillRandom()-ops <~~ ~2e5 [us] slower with --ccflags -O3
//
// For grid{1,2,3}[ 640, 640] the tested forall sum-product over dmapped Cyclic Space took 35164630 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the tested forall sum-product took 363060 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the Vass-D2-k ver sum-product took 489529 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the tested forall sum-product took 345742 [us] excl. fillRandom()-ops <-- 104x SLOWER with dmapped CyclicDist DATA
// For grid{1,2,3}[ 640, 640] the Vass-D3 orig sum-product took 353353 [us] excl. fillRandom()-ops <-- 102x SLOWER with dmapped CyclicDist DATA
// For grid{1,2,3}[ 640, 640] the Vass-D2-k ver sum-product took 471213 [us] excl. fillRandom()-ops <~~~12e5 [us] slower with --ccflags -O3
// For grid{1,2,3}[ 640, 640] the tested forall sum-product over dmapped Cyclic Space took 35075435 [us] excl. fillRandom()-ops
在任何情况下,Chapel 团队的洞察力(无论是在设计方面还是在测试方面)都很重要。@Brad 被要求提供类似的测试覆盖和比较,主要是针对更大的尺寸--size={1024 | 2048 | 4096 | 8192 | ...}
和具有多语言环境和多语言环境解决方案的“更广泛”-NUMA 平台,可在 Cray 为 Chapel 团队的研发,这不会受到硬件和~ 60 [s]
公共、赞助、共享TiO.RUN平台的限制。