我在java中制作了一些内在优化的矩阵包装器(在JNI的帮助下)。需要确认这一点,您能否提供一些有关矩阵优化的提示?我要实施的是:
矩阵可以表示为四组缓冲区/数组,一组用于水平访问,一组用于垂直访问,一组用于对角线访问,以及仅在需要时计算矩阵元素的命令缓冲区。这是一个插图。
Matrix signature:
0 1 2 3
4 5 6 7
8 9 1 3
3 5 2 9
First(hroizontal) set:
horSet[0]={0,1,2,3} horSet[1]={4,5,6,7} horSet[2]={8,9,1,3} horSet[3]={3,5,2,9}
Second(vertical) set:
verSet[0]={0,4,8,3} verSet[1]={1,5,9,5} verSet[2]={2,6,1,2} verSet[3]={3,7,3,9}
Third(optional) a diagonal set:
diagS={0,5,1,9} //just in case some calculation needs this
Fourth(calcuation list, in a "one calculation one data" fashion) set:
calc={0,2,1,3,2,5} --->0 means multiply by the next element
1 means add the next element
2 means divide by the next element
so this list means
( (a[i]*2)+3 ) / 5 when only a[i] is needed.
Example for fourth set:
A.mult(2), A.sum(3), A.div(5), A.mult(B)
(to list) (to list) (to list) (calculate *+/ just in time when A is needed )
so only one memory access for four operations.
loop start
a[i] = b[i] * ( ( a[i]*2) +3 ) / 5 only for A.mult(B)
loop end
如上所示,当需要访问列元素时,第二组提供连续访问。没有飞跃。第一组水平访问也达到了同样的效果。
这应该让一些事情变得更容易,也让一些事情变得更难:
Easier:
**Matrix transpozing operation.
Just swapping the pointers horSet[x] and verSet[x] is enough.
**Matrix * Matrix multiplication.
One matrix gives one of its horizontal set and other matrix gives vertical buffer.
Dot product of these must be highly parallelizable for intrinsics/multithreading.
If the multiplication order is inverse, then horizontal and verticals are switched.
**Matrix * vector multiplication.
Same as above, just a vector can be taken as horizontal or vertical freely.
Harder:
** Doubling memory requirement is bad for many cases.
** Initializing a matrix takes longer.
** When a matrix is multiplied from left, needs an update vertical-->horizontal
sets if its going to be multiplied from right after.(same for opposite)
(if a tranposition is taken between, this does not count)
Neutral:
** Same matrix can be multiplied with two other matrices to get two different
results such as A=A*B(saved in horizontal sets) A=C*A(saved in vertical sets)
then A=A*A gives A*B*C*A(in horizontal) and C*A*A*B (in vertical) without
copying A.
** If a matrix always multiplied from left or always from right, every access
and multiplication will not need update and be contiguous on ram.
** Only using horizontals before transpozing, only using verticals after,
should not break any rules.
主要目的是拥有一个(8 的倍数,8 的倍数)大小的矩阵,并应用具有多个线程的 avx 内在函数(每个胎面同时在一组上工作)。
我只实现了矢量*矢量点积。如果您的编程大师给一个方向,我会进入这个。
我写的点积(使用内在函数)比循环展开版本快 6 倍(是乘法的两倍),当在包装器中启用多线程时(8x --> 使用近 20GB),它也会卡在内存带宽上限/s 接近我的 ddr3 的限制)已经尝试过 opencl,它对 cpu 来说有点慢,但对 gpu 来说很好。
谢谢你。
编辑: “块矩阵”缓冲区将如何执行?当乘以大矩阵时,小块以特殊方式相乘,并且缓存可能用于减少主内存访问。但这需要在垂直-水平-对角线和这个块之间的矩阵乘法之间进行更多更新。