arm - ARM NEON 简单低通滤波器矢量化

Question

我有一个简单的单极低通滤波器（用于参数平滑），可以用以下公式解释：

y[n] = (1-a) * y[n-1] + a * x[n]

如何在 ARM Neon 上有效地矢量化这种情况 - 使用内在函数？可能吗？问题是每次计算都需要先前的结果。

score 3 · Accepted Answer

Assuming that you perform vector operations M elements at a time (I think NEON is 128 bits wide, so that would be M=4 32-bit elements), you can unroll the difference equation by a factor of M pretty easily for the simple single-pole filter. Assume that you have already calculated all outputs up to y[n]. Then, you can calculate the next four as follows:

y[n+1] = (1-a)*y[n] + a*x[n+1]
y[n+2] = (1-a)*y[n+1] + a*x[n+2] = (1-a)*((1-a)*y[n] + a*x[n+1]) + a*x[n+2]
       = (1-a)^2*y[n] + a*(1-a)*x[n+1] + a*x[n+2]
...

In general, you can write y[n+k] as:

y[n+k] = (1-a)^2*y[n] + sum_{i=1}^k a*(1-a)^{k-i}*x[n+i]

I know the above is difficult to read (maybe we can migrate this question over to Signal Processing and I can re-typeset in LaTeX). But, given an initial condition y[n] (which is assumed to be the last output calculated on the previous vectorized iteration), you can calculate the next M outputs in parallel, as the rest of the unrolled filter has an FIR-like structure.

There are some caveats to this approach: if M becomes large, then you end up multiplying a bunch of numbers together in order to get the effective FIR coefficients for the unrolled filters. Depending upon your number format and the value of a, this could have numerical precision implications. Also, you don't get an M-fold speedup with this approach: you end up calculating y[n+k] with what amounts to a k-tap FIR filter. Although you're calculating M outputs in parallel, the fact that you have to do k multiply-accumulate operations instead of the simple first-order recursive implementation diminishes some of the benefit to vectorization.

score 0 · Accepted Answer

通常，您只能对完全独立的计算集进行矢量化。但是在您的 IIR 低通中，每个输出都依赖于另一个（第一个除外），因此无法进行矢量化。

如果您的变量“a”足够大，以至于 (1-a)^n 迅速衰减到您所需的本底噪声或允许的误差以下，您可以用一个短的 FIR 滤波器近似值代替您的 IIR，并对该卷积进行矢量化。但这不太可能更快。

score 0 · Accepted Answer

将方程扩展到 4 个步骤并使用矩阵乘法怎么样？a 是常数，因此可以预先计算一个矩阵

score 0 · Accepted Answer

如果您希望对多个信号应用相同的过滤器，则只能将其真正矢量化，例如，如果它是立体声音频信号，那么您可以并行处理左右声道。四个或八个通道并行显然会更好。

arm - ARM NEON 简单低通滤波器矢量化

4 回答 4

Related

Reference