python - numpy.memmap 用于字符串数组？

Question

是否可以将numpy.memmap基于磁盘的大型字符串数组映射到内存中？

我知道它可以用于浮点数等，但这个问题专门针对字符串。

我对固定长度和可变长度字符串的解决方案感兴趣。

该解决方案可以自由指定任何合理的文件格式。

score 5 · Accepted Answer

如果所有字符串都具有相同的长度，正如术语“数组”所建议的那样，这很容易实现：

a = numpy.memmap("data", dtype="S10")

将是长度为 10 的字符串的示例。

编辑：由于显然字符串的长度不同，因此您需要索引文件以允许 O(1) 项访问。这需要读取整个文件一次并将所有字符串的起始索引存储在内存中。不幸的是，我认为没有一种纯粹的 NumPy 索引方式，而无需首先在内存中创建一个与文件大小相同的数组。不过，可以在提取索引后删除该数组。

score 2 · Accepted Answer

最灵活的选择是切换到数据库或其他更复杂的磁盘文件结构。

但是，您可能有充分的理由将内容保存为纯文本文件......

因为您可以控制文件的创建方式，所以一种选择是简单地写出第二个文件，该文件仅包含另一个文件中每个字符串的起始位置（以字节为单位）。

这将需要更多的工作，但你基本上可以做这样的事情：

class IndexedText(object):
    def __init__(self, filename, mode='r'):
        if mode not in ['r', 'w', 'a']:
            raise ValueError('Only read, write, and append is supported')
        self._mainfile = open(filename, mode)
        self._idxfile = open(filename+'idx', mode)

        if mode != 'w':
            self.indicies = [int(line.strip()) for line in self._idxfile]
        else:
            self.indicies = []

    def __enter__(self):
        return self

    def __exit__(self, type, value, traceback):
        self._mainfile.close()
        self._idxfile.close()

    def __getitem__(self, idx):
        position = self.indicies[idx]
        self._mainfile.seek(position)
        # You might want to remove the automatic stripping...
        return self._mainfile.readline().rstrip('\n')

    def write(self, line):
        if not line.endswith('\n'):
            line += '\n'
        position = self._mainfile.tell()
        self.indicies.append(position)
        self._idxfile.write(str(position)+'\n')
        self._mainfile.write(line)

    def writelines(self, lines):
        for line in lines:
            self.write(line)


def main():
    with IndexedText('test.txt', 'w') as outfile:
        outfile.write('Yep')
        outfile.write('This is a somewhat longer string!')
        outfile.write('But we should be able to index this file easily')
        outfile.write('Without needing to read the entire thing in first')

    with IndexedText('test.txt', 'r') as infile:
        print infile[2]
        print infile[0]
        print infile[3]

if __name__ == '__main__':
    main()

python - numpy.memmap 用于字符串数组？

2 回答 2

Related

Reference