arm - 用于 FP 的 Cortex A9 是否会优于 C6000 DSP

Question

我目前使用的是没有硬件 FPU 的 OMAP L138 处理器。我们将使用 FP 密集型算法处理光谱数据，因此 ARM 方面将不够用。我不是算法人，但一个是“动态时间扭曲”（我不知道这是什么意思，不）。初始性能数字为：

Core i7 笔记本电脑@ 2.9GHz：1 秒
Raspberry Pi ARM1176 @ 700MHz：12 秒
OMAP L138 ARM926 @ 300MHz：193 秒

更糟糕的是，Pi 大约是我正在使用的主板价格的 30%！

我确实有一个 TI C674x，它是 OMAP L138 中的另一个处理器。问题是我最好花几周时间尝试：

学习 DSPLINK、互操作库和工具链，更不用说为 Code Composer 或
扔掉 L138 并转向像 Pandaboard 这样的 Dual Cortex A9，可能会在此过程中遭受功率损失。

（当我查看 A8 上的 FPU 性能时，它并不是对 Rasp Pi 的改进，但 Cortex A9 似乎是）。

我知道答案是“视情况而定”。这里的其他人说“你解锁了一个令人难以置信的快速 DSP，如果分配正确的工作，它可以轻松胜过 Cortex-A8”但是对于定义的工作集，我最好还是跳到 A9，即使我不得不购买外部DSP以后？

score 5 · Accepted Answer

如果不知道 DSP 和 ARM 的时钟频率，就无法回答这个问题。

这里有一些背景：

我刚刚检查了 c674x DSP 上的浮点乘法周期：

它可以在每个周期发出两次乘法，并且每次乘法的结果延迟为三个周期（这意味着您必须等待三个额外的周期才能将结果显示在目标寄存器中）。

但是，您可以在每个周期开始两次乘法，因为 DSP 不会等待结果。编译器/汇编器将为您执行所需的调度。

这仅使用了 DSP 的八个可用功能单元中的两个，因此当您执行两个乘法时，您可以在每个周期执行以下操作：

两个加载/存储（64 位宽）
六个浮点加/减指令（或整数指令）

环路控制和分支是免费的，并且不会在 DSP 上花费您任何费用。

That makes a total of six floating point operations per cycle with parallel loads/stores and loop control.

ARM-NEON on the other hand can, in floating point mode:

Issue two multiplications per cycle. Latency is comparable, and the instructions are also pipeline-able like on the DSP. Loading/storing takes extra time as does add/subtract stuff. Loop control and branching will very likely go for free in well written code.

So in summary the DSP does three times as much work per cycle as the Cortex-A9 NEON unit.

Now you can check the clock-rates of DSP and the ARM and see what is faster for your job.

Oh, one thing: With well-written DSP code you will almost never see a cache miss during loads because you move the data from RAM to the cache using DMA before you access the data. This gives impressive speed advantages for the DSP as well.

score 0 · Accepted Answer

It does depend on the application but, generally speaking, it is rare these days for special purpose processors to beat general-purpose processors. General purpose processors now have have higher clock rates and multimedia acceleration. Even for a numerically intensive algorithm where a DSP may have an edge, the increased engineering complexity of dealing with a heterogeneous multi-processor environment makes this type of solution problematic from an ROI perspective.

arm - 用于 FP 的 Cortex A9 是否会优于 C6000 DSP

2 回答 2

Related

Reference