0

我使用 mpi4py 库用 Python 编写 MPI 程序。Python 生成并调用低级 C 代码。

我刚刚遇到了一个段错误,需要追踪它。

如果我直接用 CI 编写代码,我会在 中编译和运行我的程序gdb,找到有问题的代码行,然后从那里开始调试。

如果我用 Python 编写,我仍然会使用gdb(你可以调用gdb python filename.py,gdb 会将你指向 C 的违规行。)

但是唉,我使用的是 MPI,它使用的是使用 C 的 Python。我得到了这样的堆栈跟踪

mrocklin@baconost:~/workspace/ape$ mpiexec -np 3 -hostfile tmp/hostfile -rankfile tmp/rankfile python ape/codegen/run.py tmp/
[baconost:27370] *** Process received signal ***
[baconost:27370] Signal: Segmentation fault (11)
[baconost:27370] Signal code: Address not mapped (1)
[baconost:27370] Failing at address: 0x8
[baconost:27370] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfc60) [0x7fcd529c5c60]
[baconost:27370] [ 1] /home/mrocklin/Software/openmpi/lib/libmpi.so.0(MPI_Comm_get_errhandler+0xa0) [0x7fcd46fae240]
[baconost:27370] [ 2] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/python2.7/site-packages/mpi4py/MPI.so(+0x2ddbc) [0x7fcd47234dbc]
[baconost:27370] [ 3] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/python2.7/site-packages/mpi4py/MPI.so(initMPI+0x1f2e) [0x7fcd4724f95e]
[baconost:27370] [ 4] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(_PyImport_LoadDynamicModule+0xc2) [0x7fcd52cc2e02]
[baconost:27370] [ 5] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(+0xecd90) [0x7fcd52cc0d90]
[baconost:27370] [ 6] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(+0xed031) [0x7fcd52cc1031]
[baconost:27370] [ 7] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyImport_ImportModuleLevel+0x2be) [0x7fcd52cc206e]
[baconost:27370] [ 8] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(+0xd3fed) [0x7fcd52ca7fed]
[baconost:27370] [ 9] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyObject_Call+0x68) [0x7fcd52c19d18]
[baconost:27370] [10] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyEval_CallObjectWithKeywords+0x56) [0x7fcd52ca8516]
[baconost:27370] [11] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x28b8) [0x7fcd52caba78]
[baconost:27370] [12] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x8d2) [0x7fcd52cb0722]
[baconost:27370] [13] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyEval_EvalCode+0x32) [0x7fcd52cb0772]
[baconost:27370] [14] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyImport_ExecCodeModuleEx+0xc2) [0x7fcd52cbf6e2]
[baconost:27370] [15] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(+0xebcae) [0x7fcd52cbfcae]
[baconost:27370] [16] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(+0xecd90) [0x7fcd52cc0d90]
[baconost:27370] [17] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(+0xed264) [0x7fcd52cc1264]
[baconost:27370] [18] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyImport_ImportModuleLevel+0x118) [0x7fcd52cc1ec8]
[baconost:27370] [19] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(+0xd3fed) [0x7fcd52ca7fed]
[baconost:27370] [20] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyObject_Call+0x68) [0x7fcd52c19d18]
[baconost:27370] [21] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyEval_CallObjectWithKeywords+0x56) [0x7fcd52ca8516]
[baconost:27370] [22] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x28b8) [0x7fcd52caba78]
[baconost:27370] [23] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x8d2) [0x7fcd52cb0722]
[baconost:27370] [24] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyEval_EvalCode+0x32) [0x7fcd52cb0772]
[baconost:27370] [25] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyImport_ExecCodeModuleEx+0xc2) [0x7fcd52cbf6e2]
[baconost:27370] [26] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(+0xebcae) [0x7fcd52cbfcae]
[baconost:27370] [27] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(+0xed58d) [0x7fcd52cc158d]
[baconost:27370] [28] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(+0xecd90) [0x7fcd52cc0d90]
[baconost:27370] [29] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(+0xed264) [0x7fcd52cc1264]
[baconost:27370] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 27370 on node baconost.cs.uchicago.edu exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

两个问题:

  1. 在一般情况下,在检测到错误后使用此软件堆栈调试程序的最佳方法是什么(我想避免“您应该通过测试而不是诊断错误来防止错误”评论。我同意这一点,但有时尽管我们尽了最大努力,但仍会发生段错误)

  2. 在这种特定情况下,这是否将我指向 MPI/mpi4py问题?最后几行表明可能是这种情况。

4

2 回答 2

1

这个特定错误的问题是旧版本的 OpenMPI 不能很好地处理 rankfiles。报告了此错误,并且似乎已从1.6.2. 显然这部分代码很少使用且未经测试。

于 2012-10-11T15:53:10.880 回答
0

首先,当您使用不同的模块/产品时,可能会出现版本不匹配的情况。检查您是否使用了正确的 Python/MPI/mpi4y 版本组合。

检查完所有这些后,您几乎可以做的就是:

  1. 假设是您的代码导致了问题 - 通常是一个很好的起点。
  2. 尝试将案例分解为导致 SEGV 的几个简单语句。我知道,这是最难的部分。
  3. 如果您的代码中的错误不明显,请询问有关您的代码的具体问题。
  4. 如果所有其他方法都失败了,那么它可能是其他地方的错误。
于 2012-10-04T14:09:51.047 回答