python - 如何在没有 tmp 存储的情况下将二进制数据通过管道传输到 numpy 数组中？

Question

有几个类似的问题，但没有一个直接回答这个简单的问题：

如何在不创建要读取的临时字符串对象的情况下捕获命令输出并将该内容流式传输到 numpy 数组中？

所以，我想做的是：

import subprocess
import numpy
import StringIO

def parse_header(fileobject):
    # this function moves the filepointer and returns a dictionary
    d = do_some_parsing(fileobject)
    return d

sio = StringIO.StringIO(subprocess.check_output(cmd))
d = parse_header(sio)
# now the file pointer is at the start of data, parse_header takes care of that.
# ALL of the data is now available in the next line of sio
dt = numpy.dtype([(key, 'f8') for key in d.keys()])

# i don't know how do make this work:
data = numpy.fromxxxx(sio , dt)

# if i would do this, I create another copy besides the StringIO object, don't I?
# so this works, but isn't this 'bad' ?
datastring = sio.read()
data = numpy.fromstring(datastring, dtype=dt)

我用 StringIO 和 cStringIO 尝试过，但 numpy.frombuffer 和 numpy.fromfile 都不接受它们。

使用 StringIO 对象，我首先必须将流读入字符串，然后使用 numpy.fromstring，但我想避免创建中间对象（几千兆字节）。

如果我可以将 sys.stdin 流式传输到 numpy 数组中，对我来说另一种选择是，但这也不适用于 numpy.fromfile （需要实现搜索）。

有什么解决方法吗？我不能成为第一个尝试这个的人（除非这是一个 PEBKAC 案例？）

解决方案：这是目前的解决方案，它混合了unutbu的说明如何使用Popen with PIPE和eryksun的提示使用bytearray，所以我不知道该接受谁！？:S

proc = sp.Popen(cmd, stdout = sp.PIPE, shell=True)
d = parse_des_header(proc.stdout)
rec_dtype = np.dtype([(key,'f8') for key in d.keys()])
data = bytearray(proc.stdout.read())
ndata = np.frombuffer(data, dtype = rec_dtype)

我没有检查数据是否真的没有创建另一个副本，不知道如何。但是我注意到这比我以前尝试过的所有方法都快得多，非常感谢两位答案的作者！

2022 年更新：我刚刚尝试了上面没有 bytearray() 步骤的解决方案步骤，它工作正常。感谢Python 3，我猜？

score 6 · Accepted Answer

您可以Popen使用stdout=subprocess.PIPE. 读入标题，然后将其余部分加载到 abytearray以与np.frombuffer.

基于您的编辑的其他评论：

如果您要调用proc.stdout.read()，则相当于使用check_output()。两者都创建一个临时字符串。如果你预先分配data，你可以使用proc.stdout.readinto(data). 然后，如果读入的字节数data小于len(data)，则释放多余的内存，否则扩展data剩下的要读取的内容。

data = bytearray(2**32) # 4 GiB
n = proc.stdout.readinto(data)
if n < len(data):
    data[n:] = ''        
else:
    data += proc.stdout.read()

您也可以从预先分配ndarray ndata和使用开始buf = np.getbuffer(ndata)。然后readinto(buf)如上。

这是一个示例，显示内存在bytearray和之间共享np.ndarray：

>>> data = bytearray('\x01')
>>> ndata = np.frombuffer(data, np.int8)
>>> ndata
array([1], dtype=int8)
>>> ndata[0] = 2
>>> data
bytearray(b'\x02')

score 2 · Accepted Answer

由于您的数据可以轻松放入 RAM，因此我认为将数据加载到 numpy 数组中的最简单方法是使用ramfs。

在 Linux 上，

sudo mkdir /mnt/ramfs
sudo mount -t ramfs -o size=5G ramfs /mnt/ramfs
sudo chmod 777 /mnt/ramfs

然后，例如，如果这是二进制数据的生产者：

作家.py：

from __future__ import print_function
import random
import struct
N = random.randrange(100)
print('a b')
for i in range(2*N):
    print(struct.pack('<d',random.random()), end = '')

然后你可以像这样将它加载到一个 numpy 数组中：

阅读器.py：

import subprocess
import numpy

def parse_header(f):
    # this function moves the filepointer and returns a dictionary
    header = f.readline()
    d = dict.fromkeys(header.split())
    return d

filename = '/mnt/ramfs/data.out'
with open(filename, 'w') as f:  
    cmd = 'writer.py'
    proc = subprocess.Popen([cmd], stdout = f)
    proc.communicate()
with open(filename, 'r') as f:      
    header = parse_header(f)
    dt = numpy.dtype([(key, 'f8') for key in header.keys()])
    data = numpy.fromfile(f, dt)

python - 如何在没有 tmp 存储的情况下将二进制数据通过管道传输到 numpy 数组中？

2 回答 2

Related

Reference