处理 1TB 文件哪个更快:单台机器还是 5 台联网机器?(“处理”是指在该 1TB 文件中查找出现次数最多的单个 UTF-16 字符)。数据传输速率为 1Gbit/sec,整个 1TB 文件驻留在 1 台计算机中,每台计算机都有一个四核 CPU。
下面是我尝试使用 longs 数组(数组大小为 2^16)来跟踪字符数的问题。这应该适合单台机器的内存,因为 2^16 x 2^3(长的大小)= 2^19 = 0.5MB。任何帮助(链接、评论、建议)将不胜感激。我使用了 Jeff Dean 引用的延迟时间,并尽我所能使用我所知道的最佳近似值。最后的答案是:
单机:5.8 小时(由于从磁盘读取速度慢)
5 台联网机器:7.64 小时(由于从磁盘和网络读取)
1) Single Machine
a) Time to Read File from Disk --> 5.8 hrs
-If it takes 20ms to read 1MB seq from disk,
then to read 1TB from disk takes:
20ms/1MB x 1024MB/GB x 1024GB/TB = 20,972 secs
= 350 mins = 5.8 hrs
b) Time needed to fill array w/complete count data
--> 0 sec since it is computed while doing step 1a
-At 0.5 MB, the count array fits into L2 cache.
Since L2 cache takes only 7 ns to access,
the CPU can read & write to the count array
while waiting for the disk read.
Time: 0 sec since it is computed while doing step 1a
c) Iterate thru entire array to find max count --> 0.00625ms
-Since it takes 0.0125ms to read & write 1MB from
L2 cache and array size is 0.5MB, then the time
to iterate through the array is:
0.0125ms/MB x 0.5MB = 0.00625ms
d) Total Time
Total=a+b+c=~5.8 hrs (due to slowness of reading from disk)
2) 5 Networked Machines
a) Time to transfr 1TB over 1Gbit/s --> 6.48 hrs
1TB x 1024GB/TB x 8bits/B x 1s/Gbit
= 8,192s = 137m = 2.3hr
But since the original machine keeps a fifth of the data, it
only needs to send (4/5)ths of data, so the time required is:
2.3 hr x 4/5 = 1.84 hrs
*But to send the data, the data needs to be read, which
is (4/5)(answer 1a) = (4/5)(5.8 hrs) = 4.64 hrs
So total time = 1.84hrs + 4.64 hrs = 6.48 hrs
b) Time to fill array w/count data from original machine --> 1.16 hrs
-The original machine (that had the 1TB file) still needs to
read the remainder of the data in order to fill the array with
count data. So this requires (1/5)(answer 1a)=1.16 hrs.
The CPU time to read & write to the array is negligible, as
shown in 1b.
c) Time to fill other machine's array w/counts --> not counted
-As the file is being transferred, the count array can be
computed. This time is not counted.
d) Time required to receive 4 arrays --> (2^-6)s
-Each count array is 0.5MB
0.5MB x 4 arrays x 8bits/B x 1s/Gbit
= 2^20B/2 x 2^2 x 2^3 bits/B x 1s/2^30bits
= 2^25/2^31s = (2^-6)s
d) Time to merge arrays
--> 0 sec(since it can be merge while receiving)
e) Total time
Total=a+b+c+d+e =~ a+b =~ 6.48 hrs + 1.16 hrs = 7.64 hrs