gpu - Coprocessor accelerators compared to GPUs

Question

Are coprocessors like Intel Xeon-Phi supposed to be utilized much like the GPUs, so that one should offload a large amount of blocks executing a single kernel, so that only the overall throughput the coprocessor handles results in a speed up, OR offloading independent threads (tasks) will increase the efficiency as well?

score 2 · Accepted Answer

Xeon Phi 需要很大程度的功能并行（不同线程）和矢量并行 (SIMD)。由于内核本质上是增强型奔腾处理器，串行代码运行缓慢。这将随着下一代而有所改变，因为它将使用更快、更现代的内核。当前的 Xeon Phi 也像任何协处理器一样存在 I/O 瓶颈，必须通过 PCIe 总线进行通信。

因此，尽管您可以将内核卸载到每个处理器并利用 512 位矢量化（类似于 GPGPU），但您也可以将代码分成许多不同的功能块（即不同的代码/内核）并在不同的 Intel 集合上运行它们至强融核核心。同样，不同的代码块也必须利用 512 位 SIMD 向量。

Xeon Phi 也作为本机处理器运行，因此您可以通过挂载 NFS 目录树、使用 TCP/IP 在集群中的卡和其他处理器之间进行通信、使用 MPI 等来访问其他资源。请注意，这不是“卸载”，而是本机执行。但 PCIe 总线仍然是限制 I/O 的重要瓶颈。

总结一下，

您可以使用类似于 GPGPU 使用的卸载模型，
Xeon Phi 本身也可以支持功能并行（多个内核），但每个内核还必须利用 512 位 SIMD。
您还可以编写本机代码并使用 MPI，将 Xeon Phi 视为传统（非卸载）节点（始终记住 PCIe I/O 瓶颈）

gpu - Coprocessor accelerators compared to GPUs

1 回答 1

Related

Reference