python - 使用 python 从磁盘处理大量数据的最有效方法是什么？

Question

我正在编写一个简单的 python 脚本来读取和重建失败的 RAID5 阵列的数据，我无法以任何其他方式重建该阵列。我的脚本正在运行，但速度很慢。我的原始脚本以大约 80MB/分钟的速度运行。此后我改进了脚本，它以 550MB/分钟的速度运行，但这似乎仍然有点低。python 脚本位于 100% CPU，所以它似乎是 CPU 而不是磁盘受限，这意味着我有机会进行优化。因为脚本根本不是很长，所以我无法有效地分析它，所以我不知道是什么在吃掉它。这是我现在的脚本（或者至少是重要的部分）

disk0chunk = disk0.read(chunkSize)
#disk1 is missing, bad firmware
disk2chunk = disk2.read(chunkSize)
disk3chunk = disk3.read(chunkSize)
if (parityDisk % 4 == 1): #if the parity stripe is on the missing drive
  output.write(disk0chunk + disk2chunk + disk3chunk)
else: #we need to rebuild the data in disk1
  # disk0num = map(ord, disk0chunk) #inefficient, old code
  # disk2num = map(ord, disk2chunk) #inefficient, old code
  # disk3num = map(ord, disk3chunk) #inefficient, old code
  disk0num = struct.depack("16384l", disk0chunk) #more efficient new code
  disk2num = struct.depack("16384l", disk2chunk) #more efficient new code
  disk3num = struct.depack("16384l", disk3chunk) #more efficient new code
  magicpotato = zip(disk0num,disk2num,disk3num)
  disk1num = map(takexor, magicpotato)
  # disk1bytes = map(chr, disk1num) #inefficient, old code
  # disk1chunk = ''.join(disk1bytes) #inefficient, old code
  disk1chunk = struct.pack("16384l", *disk1num) #more efficient new code

  #output nonparity to based on parityDisk

def takexor(magicpotato):
  return magicpotato[0]^magicpotato[1]^magicpotato[2]

粗体表示这个巨大文本块中的实际问题：

我可以做些什么来使这更快/更好吗？如果什么都没想到，我能做些什么来更好地研究是什么让这件事进展缓慢？（有没有办法在每行级别上分析 python？）我是否以正确的方式处理这个问题，还是有更好的方法来处理大量二进制数据？

我问的原因是我有一个 3TB 驱动器重建，即使它工作正常（我可以挂载图像 ro，循环和浏览文件很好）它需要很长时间。我用旧代码测量它需要到一月中旬，现在它要到圣诞节（所以它要好得多，但它仍然比我预期的要慢。）

Before you ask, this is an mdadm RAID5 (64kb blocksize, left symmetric) but the mdadm metadata is missing somehow and mdadm does not allow you to reconfigure a RAID5 without rewriting the metadata to the disk, which I am trying to avoid at all costs, I don't want to risk screwing something up and losing data, however remote the possibility may be.

score 3 · Accepted Answer

map(takexor, magicpotato) - This is probably better done with direct iteration, map isn't efficient if it needs to call other python code AFAIK, it needs to construct and destroy 16384 frame objects to perform the call, etc.
Use the array module instead of struct
If it's still too slow compile it with cython and add some static types (that will probably make it 2-3 orders of magnitude faster)

score 0 · Accepted Answer

Google for: widefinder python. Some of the techniques discussed in the Python entries might be of use, such as memory mapping IO.

python - 使用 python 从磁盘处理大量数据的最有效方法是什么？

2 回答 2

Related

Reference