Do you iterate the matrix by row or by colum or both? Do you always access nearby elements or do you do random accesses on the matrix.
If there is some locality in your accesses but you're not accessing it sequential (typical in matrix multiplication for example) then you can get a huge performance difference by storing your matrix in a more cache-friendly way.
A pretty easy way to do that is to write a little access function to turn your row/colum indices into an index and work on a one dimensional matrix, the cache-friendy way.
The function should group nearby coordinates into nearby indices. The morton-order can be used if you work on power of two sizes. For non-power sizes you can often bring just the lowest 4 bits into morton order and use normal index-arithmetic for the upper bits. You'll still get a significant speed-up, even if the coordinate to index conversion looks seems to be a costly operation.
http://en.wikipedia.org/wiki/Z-order_(curve) <-- sorry, can't link that SO does not like URL's with a dash in it. You have to cut'n'paste.
A speed up of factor 10 and more are realistic btw. It depends on the algorithm you ron over your matrices though.