你选择了一个不好的例子,因为都铎很好地指出了这一点。旋转磁盘硬件受到移动盘片和磁头的物理约束,最有效的读取实现是按顺序读取每个块,这减少了移动磁头或等待磁盘对齐的需要。
也就是说,某些操作系统并不总是将内容连续存储在磁盘上,对于那些记得的人来说,如果您的操作系统/文件系统没有为您完成这项工作,碎片整理可以提高磁盘性能。
正如您提到的想要一个有益的程序,让我建议一个简单的程序,矩阵加法。
假设您为每个内核创建了一个线程,您可以轻松地将要添加的任意两个矩阵划分为 N(每个线程一个)行。矩阵加法(如果你还记得的话)是这样工作的:
A + B = C
或者
[ a11, a12, a13 ] [ b11, b12, b13] = [ (a11+b11), (a12+b12), (a13+c13) ]
[ a21, a22, a23 ] + [ b21, b22, b23] = [ (a21+b21), (a22+b22), (a23+c23) ]
[ a31, a32, a33 ] [ b31, b32, b33] = [ (a31+b31), (a32+b32), (a33+c33) ]
因此,要将其分布在 N 个线程中,我们只需将行数和模数除以线程数即可获得将添加的“线程 ID”。
matrix with 20 rows across 3 threads
row % 3 == 0 (for rows 0, 3, 6, 9, 12, 15, and 18)
row % 3 == 1 (for rows 1, 4, 7, 10, 13, 16, and 19)
row % 3 == 2 (for rows 2, 5, 8, 11, 14, and 17)
// row 20 doesn't exist, because we number rows from 0
现在每个线程“知道”它应该处理哪些行,并且“每行”的结果可以简单地计算,因为结果不会跨越其他线程的计算域。
现在需要的只是一个“结果”数据结构,它跟踪计算值的时间,以及设置最后一个值的时间,然后计算完成。在这个带有两个线程的矩阵加法结果的“假”示例中,使用两个线程计算答案大约需要一半的时间。
// the following assumes that threads don't get rescheduled to different cores for
// illustrative purposes only. Real Threads are scheduled across cores due to
// availability and attempts to prevent unnecessary core migration of a running thread.
[ done, done, done ] // filled in at about the same time as row 2 (runs on core 3)
[ done, done, done ] // filled in at about the same time as row 1 (runs on core 1)
[ done, done, .... ] // filled in at about the same time as row 4 (runs on core 3)
[ done, ...., .... ] // filled in at about the same time as row 3 (runs on core 1)
更复杂的问题可以通过多线程来解决,不同的问题用不同的技术解决。我特意选择了一个最简单的例子。