c++ - 提高通过 ctypes 将数据从 Python 传递到 C(++) 的速度

Question

我需要针对时间关键型机器人应用程序优化循环中的函数调用。我的脚本在 python 中，它通过 ctypes 与我编写的 C++ 库接口，然后调用微控制器库。

瓶颈是向微控制器缓冲区添加位置-速度-时间点。根据我的时间检查，通过 ctypes 调用 C++ 函数大约需要0.45几秒钟，而在 C++ 端，调用函数需要0.17几秒钟。我需要以某种方式减少这种差异。

这是相关的 python 代码，其中数据是点的二维数组，而库是通过 ctypes 加载的：

data_np = np.vstack([nodes, positions, velocities, times]).transpose().astype(np.long)

data = ((c_long * 4) * N)()
for i in range(N):
    data[i] = (c_long * 4)(*data_np[i])

timer = time()
clibrary.addPvtAll(N, data)
print("clibrary.addPvtAll() call: %f" % (time() - timer))

这是被调用的 C++ 函数：

void addPvtAll(int N, long data[][4]) {

    clock_t t0, t1;
    t0 = clock();

    for(int i = 0; i < N; i++) {
        unsigned short node = (unsigned short)data[i][0];
        long p = data[i][1];
        long v = data[i][2];
        unsigned char t = (unsigned char)data[i][3];

        VCS_AddPvtValueToIpmBuffer(device(node), node, p, v, t, &errorCode);
    }

    t1 = clock();
    printf("addPvtAll() call: %f \n", (double(t1 - t0) / CLOCKS_PER_SEC));
}

我不是绝对需要使用 ctypes，但我不想每次运行 Python 代码时都必须编译它。

score 1 · Accepted Answer

Python 和 C++ 之间的往返可能很昂贵，尤其是在使用ctypes时（它就像普通 C/Python 包装器的解释版本）。

您的目标应该是尽量减少旅行次数，并在每次旅行中尽可能多地完成工作。

在我看来，您的代码的粒度太细了（即，进行了太多的旅行，而每次旅行都做的工作太少）。

numpy包可以将其数据直接暴露给 C/C++。这将让您避免昂贵的 Python 对象装箱和拆箱（以及随之而来的内存分配），并且它可以让您传递一系列数据点，而不是一次传递一个点。

修改您的 C++ 代码以一次处理多个点，而不是每次调用一次（就像sqlite3模块对execute与executemany 所做的一样）。

score 0 · Accepted Answer

您可以使用data_np.data.tobytes()：

data_np = np.vstack([nodes, positions, velocities, times]).transpose().astype(np.long)
timer = time()
clibrary.addPvtAll(N, data_np.data.tobytes())
print("clibrary.addPvtAll() call: %f" % (time() - timer))

score 0 · Accepted Answer

这是我的解决方案，它有效地消除了 Python 和 C 之间的测量时间差。感谢 kirbyfan64sos 建议 SWIG 和 Raymond Hettinger 用于 numpy 中的 C 数组。我在 Python 中使用了一个 numpy 数组，它纯粹作为指针发送到 C - 两种语言都访问相同的内存块。

除了使用gettimeofday()而不是clock()给出不准确的时间之外，C 函数保持相同：

void addPvtFrame(int pvt[6][4]) {

    timeval start,stop,result;
    gettimeofday(&start, NULL);

    for(int i = 0; i < 6; i++) {
        unsigned short node = (unsigned short)pvt[i][0];
        long p = (long)pvt[i][1];
        long v = (long)pvt[i][2];
        unsigned char t = (unsigned char)pvt[i][3];

        VCS_AddPvtValueToIpmBuffer(device(node), node, p, v, t, &errorCode);
    }

    gettimeofday(&stop, NULL);
    timersub(&start,&stop,&result);
    printf("Add PVT time in C code: %fs\n", -(result.tv_sec + result.tv_usec/1000000.0));
}

此外，我安装了 SWIG 并在我的接口文件中包含以下内容：

%include "numpy.i"
%init %{
    import_array();
%}

%apply ( int INPLACE_ARRAY2[ANY][ANY] ) {(int pvt[6][4])}

最后，我的 Python 代码pvt通过 numpy 构造为一个连续数组：

pvt = np.vstack([nodes, positions, velocities, times])
pvt = np.ascontiguousarray(pvt.transpose().astype(int))

timer = time()
xjus.addPvtFrame(pvt)
print("Add PVT time to C code: %fs" % (time() - timer))

现在在我的机器上测量的时间大约有 %1 的差异。

c++ - 提高通过 ctypes 将数据从 Python 传递到 C(++) 的速度

3 回答 3

Related

Reference