python - 使用python向后读取二进制文件

Question

我尝试向后读取文件（从头到尾）。下面的例子就是这样做的，但我想问问社区——我的问题有更优雅的解决方案吗？

import os, binascii

CHUNK = 10 #read file by blocks (big size)
src_file_path = 'd:\\src\\python\\test\\main.zip' 
src_file_size = os.path.getsize(src_file_path)
src_file = open(src_file_path, 'rb') #open in binary mode
while src_file_size > 0:
    #read file from last byte to first :)
    if src_file_size > CHUNK:
        src_file.seek(src_file_size - CHUNK)
        byte_list = src_file.read(CHUNK)
    else:
        src_file.seek(0)
        byte_list = src_file.read(src_file_size)
    s = binascii.hexlify(byte_list) #convert '\xFB' -> 'FB'
    byte_list = [(chr(s[i]) + chr(s[i+1])) for i in range(0, len(s), 2)] #split, note below
    print(byte_list[::-1]) #output reverse list
    src_file_size = src_file_size - CHUNK
src_file.close() #close file

UPD我想知道专家的意见 - 作为Python新手我需要注意什么？这段代码是否存在潜在缺陷？

提前致谢。

我正在使用 Python 3.3.1 注意：从这里按字节分割！

score 2 · Accepted Answer

在这里用 mmap 详细说明 tim-hoffman 的最佳答案。（对不起，我会评论而不是回答，但我还没有足够的 stackfoo 来评论）。

import mmap
# Reverses a binary byte-wise in an efficient manner
with open("out.bin","wb") as w:
    with open("in.bin,"rb") as f:
        # read-only access or you get an access-denied or need to use r+b permissions
        mm = mmap.mmap(f.fileno(),0,access=mmap.ACCESS_READ)
        w.write(mm[::-1])

score 1 · Accepted Answer

另一种方法是使用 mmap。

http://docs.python.org/2/library/mmap.html

在此示例中，文本文件的内容为“0987654321\n”

>>> import mmap
>>> f = open("x.txt","r+b")
>>> mm = mmap.mmap(f.fileno(), 0)
>>> mm[0:]
'0987654321\n'
>>> 
>>> for i in range(len(mm),0,-1):
...     if i == 1:
...          print i,repr(mm[0:1])
...     else:
...          print i,repr(mm[i-1:i-2:-1])
... 
11 '\n'
10 '1'
9 '2'
8 '3'
7 '4'
6 '5'
5 '6'
4 '7'
3 '8'
2 '9'
1 '0'

然后，您可以使用范围和切片更改您的块大小。让我们以 3 个为一组向后退一步。

>>> for i in range(len(mm)-1,-1,-3):
...   if i < 3:
...      print i,repr(mm[0:i+1])
...   else:
...      print i,repr(mm[i:i-3:-1])
... 
10 '\n12'
7 '345'
4 '678'
1 '09'
>>>

一个很大的优势是你不需要做任何倒车等......

score 1 · Accepted Answer

我可以从问题的代码中看到几处需要改进的地方。首先，循环在 Python 中很少使用，因为使用循环或使用一些内置函数while几乎总是有更好的表达方式。for

我猜代码纯粹是出于培训目的。否则，我会先问真正的目标是什么（因为知道问题，更好的解决方案可能与第一个想法大不相同）。

这里的目标是获得seek. 你知道大小，你知道块大小，你想倒退。在 Python 中有一个用于此目的的内置生成器，名为range. 主要使用单个参数；但是，range(start, stop, step)是完整的形式。生成器可以在for循环中迭代，或者您可以使用这些值来构建它们的列表（但您通常不需要后一种情况）。的位置seek可以这样生成：

chunk = 10
sz = 235

lst = list(range(sz - chunk, 0, -chunk))
print(lst)

即，您从sz - chunk位置开始，在零处停止（不经常），使用负值作为下一个生成值。这里list()迭代所有值并构建它们的列表。但是您可以直接遍历生成的值：

for pos in range(sz - chunk, 0, -chunk):
    print('seek({}) and read({})'.format(pos, chunk))

if pos > 0:
    print('seek({}) and read({})'.format(0, pos))

最后生成的位置是或零或正。这样，if当最后一部分短于时，last 会处理最后一部分chunk。将上面的代码放在一起，它会打印：

c:\tmp\_Python\wikicsm\so16443185>py a.py
[225, 215, 205, 195, 185, 175, 165, 155, 145, 135, 125, 115, 105, 95,
85, 75, 65, 55, 45, 35, 25, 15, 5]
seek(225) and read(10)
seek(215) and read(10)
seek(205) and read(10)
seek(195) and read(10)
seek(185) and read(10)
seek(175) and read(10)
seek(165) and read(10)
seek(155) and read(10)
seek(145) and read(10)
seek(135) and read(10)
seek(125) and read(10)
seek(115) and read(10)
seek(105) and read(10)
seek(95) and read(10)
seek(85) and read(10)
seek(75) and read(10)
seek(65) and read(10)
seek(55) and read(10)
seek(45) and read(10)
seek(35) and read(10)
seek(25) and read(10)
seek(15) and read(10)
seek(5) and read(10)
seek(0) and read(5)

我个人会print通过调用将获取文件对象、pos 和块大小的函数来替换 's。这里伪造的身体产生相同的印刷品：

#!python3
import os

def processChunk(f, pos, chunk_size):
    print('faked f: seek({}) and read({})'.format(pos, chunk_size))


fname = 'a.txt'
sz = os.path.getsize(fname)     # not checking existence for simplicity
chunk = 16

with open(fname, 'rb') as f:
    for pos in range(sz - chunk, 0, -chunk):
        processChunk(f, pos, chunk)

    if pos > 0:
        processChunk(f, 0, pos)

构造with是另一个好学的东西。（警告，与 Pascal 的 . 没有任何相似之处with。）它在块结束后自动关闭文件对象。请注意，下面的代码with更具可读性，以后无需更改。将processChunk进一步发展：

def processChunk(f, pos, chunk_size):
    f.seek(pos)
    s = binascii.hexlify(f.read(chunk_size))
    print(s)

或者您可以稍微更改它，使其结果是反向的 hexdump（在我的计算机上测试的完整代码）：

#!python3

import binascii
import os

def processChunk(f, pos, chunk_size):
    f.seek(pos)
    b = f.read(chunk_size)
    b1 = b[:8]                  # first 8 bytes
    b2 = b[8:]                  # the rest
    s1 = ' '.join('{:02x}'.format(x) for x in b1)
    s2 = ' '.join('{:02x}'.format(x) for x in b2)
    print('{:08x}:'.format(pos), s1, '|', s2)


fname = 'a.txt'
sz = os.path.getsize(fname)     # not checking existence for simplicity
chunk = 16

with open(fname, 'rb') as f:

    for pos in range(sz - chunk, 0, -chunk):
        processChunk(f, pos, chunk)

    if pos > 0:
        processChunk(f, 0, pos)

a.txt最后一个代码的副本是什么时候，它会产生：

c:\tmp\_Python\wikicsm\so16443185>py d.py
00000274: 75 6e 6b 28 66 2c 20 30 | 2c 20 70 6f 73 29 0d 0a
00000264: 20 20 20 20 20 20 20 70 | 72 6f 63 65 73 73 43 68
00000254: 20 20 69 66 20 70 6f 73 | 20 3e 20 30 3a 0d 0a 20
00000244: 6f 73 2c 20 63 68 75 6e | 6b 29 0d 0a 0d 0a 20 20
00000234: 72 6f 63 65 73 73 43 68 | 75 6e 6b 28 66 2c 20 70
00000224: 75 6e 6b 29 3a 0d 0a 20 | 20 20 20 20 20 20 20 70
00000214: 20 2d 20 63 68 75 6e 6b | 2c 20 30 2c 20 2d 63 68
00000204: 20 70 6f 73 20 69 6e 20 | 72 61 6e 67 65 28 73 7a
000001f4: 61 73 20 66 3a 0d 0a 0d | 0a 20 20 20 20 66 6f 72
000001e4: 65 6e 28 66 6e 61 6d 65 | 2c 20 27 72 62 27 29 20
000001d4: 20 3d 20 31 36 0d 0a 0d | 0a 77 69 74 68 20 6f 70
000001c4: 69 6d 70 6c 69 63 69 74 | 79 0d 0a 63 68 75 6e 6b
000001b4: 20 65 78 69 73 74 65 6e | 63 65 20 66 6f 72 20 73
000001a4: 20 20 23 20 6e 6f 74 20 | 63 68 65 63 6b 69 6e 67
00000194: 65 74 73 69 7a 65 28 66 | 6e 61 6d 65 29 20 20 20
00000184: 0d 0a 73 7a 20 3d 20 6f | 73 2e 70 61 74 68 2e 67
00000174: 0a 66 6e 61 6d 65 20 3d | 20 27 61 2e 74 78 74 27
00000164: 31 2c 20 27 7c 27 2c 20 | 73 32 29 0d 0a 0d 0a 0d
00000154: 27 2e 66 6f 72 6d 61 74 | 28 70 6f 73 29 2c 20 73
00000144: 20 20 70 72 69 6e 74 28 | 27 7b 3a 30 38 78 7d 3a
00000134: 66 6f 72 20 78 20 69 6e | 20 62 32 29 0d 0a 20 20
00000124: 30 32 78 7d 27 2e 66 6f | 72 6d 61 74 28 78 29 20
00000114: 32 20 3d 20 27 20 27 2e | 6a 6f 69 6e 28 27 7b 3a
00000104: 20 78 20 69 6e 20 62 31 | 29 0d 0a 20 20 20 20 73
000000f4: 7d 27 2e 66 6f 72 6d 61 | 74 28 78 29 20 66 6f 72
000000e4: 20 27 20 27 2e 6a 6f 69 | 6e 28 27 7b 3a 30 32 78
000000d4: 65 20 72 65 73 74 0d 0a | 20 20 20 20 73 31 20 3d
000000c4: 20 20 20 20 20 20 20 20 | 20 20 20 20 23 20 74 68
000000b4: 62 32 20 3d 20 62 5b 38 | 3a 5d 20 20 20 20 20 20
000000a4: 73 74 20 38 20 62 79 74 | 65 73 0d 0a 20 20 20 20
00000094: 20 20 20 20 20 20 20 20 | 20 20 20 23 20 66 69 72
00000084: 31 20 3d 20 62 5b 3a 38 | 5d 20 20 20 20 20 20 20
00000074: 75 6e 6b 5f 73 69 7a 65 | 29 0d 0a 20 20 20 20 62
00000064: 20 20 20 62 20 3d 20 66 | 2e 72 65 61 64 28 63 68
00000054: 20 20 66 2e 73 65 65 6b | 28 70 6f 73 29 0d 0a 20
00000044: 63 68 75 6e 6b 5f 73 69 | 7a 65 29 3a 0d 0a 20 20
00000034: 73 73 43 68 75 6e 6b 28 | 66 2c 20 70 6f 73 2c 20
00000024: 20 6f 73 0d 0a 0d 0a 64 | 65 66 20 70 72 6f 63 65
00000014: 62 69 6e 61 73 63 69 69 | 0d 0a 69 6d 70 6f 72 74
00000004: 74 68 6f 6e 33 0d 0a 0d | 0a 69 6d 70 6f 72 74 20
00000000: 23 21 70 79 |

对于，您可以像在 Windows 中src_file_path = 'd:\\src\\python\\test\\main.zip'一样使用正斜杠。src_file_path = 'd:/src/python/test/main.zip'或者您可以使用原始字符串，例如 src_file_path = r'd:\src\python\test\main.zip'。当您需要避免双反斜杠时使用最后一种情况 - 通常在编写常规表达式时。

python - 使用python向后读取二进制文件

3 回答 3

Related

Reference