While I realize this is a niche question, I am wondering if anyone knows of an algorithm for matrix matrix multiplication, that would be really great(meaning use a lot of flops of the cpu or possibly gpu) at matrices of sizes between 100x100 to 500x500?
While I know xgemm and xgemm3m are nice, unfortunately they get the big flops for matrices bigger than 1000x1000.
thanks for the help :)