我有一个简单的单极低通滤波器(用于参数平滑),可以用以下公式解释:
y[n] = (1-a) * y[n-1] + a * x[n]
如何在 ARM Neon 上有效地矢量化这种情况 - 使用内在函数?可能吗?问题是每次计算都需要先前的结果。
我有一个简单的单极低通滤波器(用于参数平滑),可以用以下公式解释:
y[n] = (1-a) * y[n-1] + a * x[n]
如何在 ARM Neon 上有效地矢量化这种情况 - 使用内在函数?可能吗?问题是每次计算都需要先前的结果。
Assuming that you perform vector operations M
elements at a time (I think NEON is 128 bits wide, so that would be M=4
32-bit elements), you can unroll the difference equation by a factor of M
pretty easily for the simple single-pole filter. Assume that you have already calculated all outputs up to y[n]
. Then, you can calculate the next four as follows:
y[n+1] = (1-a)*y[n] + a*x[n+1]
y[n+2] = (1-a)*y[n+1] + a*x[n+2] = (1-a)*((1-a)*y[n] + a*x[n+1]) + a*x[n+2]
= (1-a)^2*y[n] + a*(1-a)*x[n+1] + a*x[n+2]
...
In general, you can write y[n+k]
as:
y[n+k] = (1-a)^2*y[n] + sum_{i=1}^k a*(1-a)^{k-i}*x[n+i]
I know the above is difficult to read (maybe we can migrate this question over to Signal Processing and I can re-typeset in LaTeX). But, given an initial condition y[n]
(which is assumed to be the last output calculated on the previous
vectorized iteration), you can calculate the next M
outputs in parallel, as the rest of the unrolled filter has an FIR-like structure.
There are some caveats to this approach: if M
becomes large, then you end up multiplying a bunch of numbers together in order to get the effective FIR coefficients for the unrolled filters. Depending upon your number format and the value of a
, this could have numerical precision implications. Also, you don't get an M
-fold speedup with this approach: you end up calculating y[n+k]
with what amounts to a k
-tap FIR filter. Although you're calculating M
outputs in parallel, the fact that you have to do k
multiply-accumulate operations instead of the simple first-order recursive implementation diminishes some of the benefit to vectorization.
通常,您只能对完全独立的计算集进行矢量化。但是在您的 IIR 低通中,每个输出都依赖于另一个(第一个除外),因此无法进行矢量化。
如果您的变量“a”足够大,以至于 (1-a)^n 迅速衰减到您所需的本底噪声或允许的误差以下,您可以用一个短的 FIR 滤波器近似值代替您的 IIR,并对该卷积进行矢量化。但这不太可能更快。
将方程扩展到 4 个步骤并使用矩阵乘法怎么样?a 是常数,因此可以预先计算一个矩阵
如果您希望对多个信号应用相同的过滤器,则只能将其真正矢量化,例如,如果它是立体声音频信号,那么您可以并行处理左右声道。四个或八个通道并行显然会更好。