python - Python中随机文本的更快解决方案

Question

我需要一个快速的解决方案来处理 Python 中随机的 w/r 文本片段。我想做的是这样的：

编写代码段并记录指针
使用指针检索片段

片段的长度是任意的，我选择不使用数据库来存储它们，而只使用指针。通过简单地用 C 函数替换 Python 文件方法（解决方案 1），它非常快，并且指针仅包含片段的“位置”和“多长时间”。在那之后，我尝试了我认为可以与 Berkeley DB 一起使用的真实东西。我不知道该怎么称呼它，也许是“分页”？

问题是，这段代码确实有效，比解决方案 1 快 1.5 到 2 倍，但速度并没有快多少，需要使用 4 部分指针。也许这不是一个值得的方法，但有没有显着改进的空间？

以下是代码：

from collections import namedtuple
from ctypes import cdll,c_char_p,\
     c_void_p,c_size_t,c_long,\
     c_int,create_string_buffer
libc = cdll.msvcrt
fopen = libc.fopen
fread = libc.fread
fwrite = libc.fwrite
fseek = libc.fseek
ftell = libc.ftell
fflush = libc.fflush
fclose = libc.fclose

#######################################################
# The following is how to write a snippet into the SnippetBase file

ptr = namedtuple('pointer','blk1, start, nblk, length')
snippet = '''
blk1: the first blk where the snippet is
start: the start of this snippet
nblk: number of blocks this snippet takes
length: length of this snippet
'''
bsize = 4096 # bsize: block size

fh = fopen('.\\SnippetBase.txt','wb')
fseek(fh,0,2)
pos1 = divmod(ftell(fh),bsize)
fwrite(snippet,c_size_t(len(snippet)),1,fh)
fflush(fh)
pos2 = divmod(ftell(fh),bsize)
ptr = ptr(pos1[0],pos1[1],pos2[0]-pos1[0]+1,len(snippet))
fclose(fh)


#######################################################
# The following is how to read the snippet from the SnippetBase file

fh = fopen('.\\SnippetBase.txt','rb')
fseek(fh,c_long(ptr.blk1*bsize),1)
buff = create_string_buffer(ptr.nblk*bsize)
fread(buff,c_size_t(ptr.nblk*bsize),1,fh)
print buffer(buff,ptr.start,ptr.length)
fclose(fh)

score 1 · Accepted Answer

这看起来像是一种很难且不可移植的方式来优化一件事——由 Python 包装器执行的内存分配file.read和os.read. 所有其他部分都可以使用 Python 标准库中已有的函数轻松完成。甚至还有一个简单的方法可以在bytearray. 该io模块确实包含一个方法readinto，它存在于文件类型中；我高度怀疑这确实避免了分配。然而，在最流行的操作系统上，我们可以更进一步——直接使用操作系统磁盘缓冲区，而不是为我们的进程分配本地内存。这是使用完成的mmap（但是当文件太大而无法放入您的地址空间时，使用起来会变得很棘手）。对于从 mmaped 文件中读取数据的非分配方法，只需使用buffer(mm, offset, size).

python - Python中随机文本的更快解决方案

1 回答 1

Related

Reference