2

我在集群上执行 Boost.Test 测试用例时遇到问题。错误是:*** glibc detected *** ...myprogram.test: corrupted double-linked list: 0x000000000096b4d0 ***

在此运行 valgrind 给了我:

==9687== Invalid free() / delete / delete[] / realloc()
==9687==    at 0x4A06016: operator delete(void*) (vg_replace_malloc.c:480)
==9687==    by 0x3A81035D2C: __cxa_finalize (in /lib64/libc-2.12.so) 
==9687==    by 0x721CD05: ??? (in /lib/libboost_unit_test_framework-gcc71-mt-d-1_65_1.so.1.65.1)
==9687==    by 0x72ABF9C: ??? (in /lib/libboost_unit_test_framework-gcc71-mt-d-1_65_1.so.1.65.1)
==9687==    by 0x3A81035991: exit (in /lib64/libc-2.12.so)
==9687==    by 0x3A8101ED23: (below main) (in /lib64/libc-2.12.so)   
==9687==  Address 0x9919d80 is 0 bytes inside a block of size 18 free'd
==9687==    at 0x4A06016: operator delete(void*) (vg_replace_malloc.c:480)
==9687==    by 0x3A81035991: exit (in /lib64/libc-2.12.so)
==9687==    by 0x3A8101ED23: (below main) (in /lib64/libc-2.12.so)   

GDB 的堆栈跟踪如下所示:

#0  0x0000003a81032495 in raise () from /lib64/libc.so.6
#1  0x0000003a81033c75 in abort () from /lib64/libc.so.6
#2  0x0000003a810703a7 in __libc_message () from /lib64/libc.so.6
#3  0x0000003a81075dee in malloc_printerr () from /lib64/libc.so.6
#4  0x0000003a810761f3 in malloc_consolidate () from /lib64/libc.so.6
#5  0x0000003a81078c18 in _int_free () from /lib64/libc.so.6
#6  0x00000000005feae8 in boost::checked_array_delete<char(x=0x991a20 "\210\350\070\201:") at /include/boost-1_65_1/boost/core/checked_delete.hpp:41
#7  0x00000000005fbd21 in boost::scoped_array<char>::~scoped_array (this=0x94bd80, __in_chrg=<optimized out>) at /include/boost-1_65_1/boost/smart_ptr/scoped_array.hpp:69
#8  0x00000000005f9d36 in boost::execution_monitor::~execution_monitor (this=0x94bd60, __in_chrg=<optimized out>)
    at /include/boost-1_65_1/boost/test/execution_monitor.hpp:316
#9  0x00000000005fbd3c in boost::unit_test::unit_test_monitor_t::~unit_test_monitor_t (this=0x94bd60, __in_chrg=<optimized out>)
    at /include/boost-1_65_1/boost/test/unit_test_monitor.hpp:33
#10 0x0000003a81035992 in exit () from /lib64/libc.so.6
#11 0x0000003a8101ed24 in __libc_start_main () from /lib64/libc.so.6
#12 0x00000000005f5b59 in _start ()

当抛出任何未捕获的异常(包括测试失败)时,以及在某些(当前未知的)场合下,都会发生这种情况。但是异常崩溃是 100% 可重现的。

该程序似乎很好,因为它在本地运行而没有任何此类崩溃。所以我认为这是由于集群上某些模块之间的不兼容造成的。

为避免这种情况,我重新编译了 Boost 和 OpenBLAS,但我仍在使用其他几个库,我不想重建(需要很多时间)只是为了测试它们中的每一个。这些是 libSSH2、GPI2、HDF5,尽管它们没有出现在 ldd 中,所以我假设静态链接(我不是测试的作者)并认为它们不太可能导致问题:

    linux-vdso.so.1 =
    libpthread.so.0 =/lib64/libpthread.so.0
    librt.so.1 =/lib64/librt.so.1
    libboost_filesystem-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_filesystem-gcc71-mt-d-1_65_1.so.1.65.1
    libboost_program_options-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_program_options-gcc71-mt-d-1_65_1.so.1.65.1
    libboost_coroutine-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_coroutine-gcc71-mt-d-1_65_1.so.1.65.1
    libboost_context-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_context-gcc71-mt-d-1_65_1.so.1.65.1
    libboost_iostreams-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_iostreams-gcc71-mt-d-1_65_1.so.1.65.1
    libboost_regex-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_regex-gcc71-mt-d-1_65_1.so.1.65.1
    libboost_thread-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_thread-gcc71-mt-d-1_65_1.so.1.65.1
    libboost_date_time-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_date_time-gcc71-mt-d-1_65_1.so.1.65.1
    libboost_chrono-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_chrono-gcc71-mt-d-1_65_1.so.1.65.1
    libboost_atomic-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_atomic-gcc71-mt-d-1_65_1.so.1.65.1
    libboost_system-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_system-gcc71-mt-d-1_65_1.so.1.65.1
    libboost_serialization-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_serialization-gcc71-mt-d-1_65_1.so.1.65.1
    libdl.so.2 =/lib64/libdl.so.2
    libssl.so.10 =/usr/lib64/libssl.so.10
    libgssapi_krb5.so.2 =/lib64/libgssapi_krb5.so.2
    libkrb5.so.3 =/lib64/libkrb5.so.3
    libcom_err.so.2 =/lib64/libcom_err.so.2
    libk5crypto.so.3 =/lib64/libk5crypto.so.3
    libresolv.so.2 =/lib64/libresolv.so.2
    libcrypto.so.10 =/usr/lib64/libcrypto.so.10
    libz.so.1 =/lib64/libz.so.1
    libstdc++.so.6 =/sw/global/compilers/gcc/7.1.0/lib64/libstdc++.so.6
    libm.so.6 =/lib64/libm.so.6
    libgcc_s.so.1 =/sw/global/compilers/gcc/7.1.0/lib64/libgcc_s.so.1
    libc.so.6 =/lib64/libc.so.6
    /lib64/ld-linux-x86-64.so.2
    libbz2.so.1 =/lib64/libbz2.so.1
    liblzma.so.0 =/usr/lib64/liblzma.so.0
    libicudata.so.42 =/usr/lib64/libicudata.so.42
    libicui18n.so.42 =/usr/lib64/libicui18n.so.42
    libicuuc.so.42 =/usr/lib64/libicuuc.so.42
    libkrb5support.so.0 =/lib64/libkrb5support.so.0
    libkeyutils.so.1 =/lib64/libkeyutils.so.1
    libselinux.so.1 =/lib64/libselinux.so.1

根据我的发现,我认为第二个免费是“正确”的,因为它是释放内存的智能指针。所以第一次删除是错误的,但它来自内部exit,这对我没有帮助。

我怎样才能找到,为什么以及如何指针是双重释放的?请注意,我在集群上没有 root,因此 GCC 库的调试符号不可用。

使用的编译器是 GCC 7.1 和 Boost 1.65.1,尽管我已经尝试过其他 Boost 版本和 GCC 5.3

我将一个测试用例简化为:

  • 链接到库
  • BOOST_AUTO_TEST_CASE(...)
  • std::runtime_error

所以问题出在库的静态初始化/完成中。

4

1 回答 1

0

您是否使用数据集(数据驱动测试用例)?

如果是这样,您可能会遇到https://svn.boost.org/trac10/ticket/13380

我之前在这里遇到过并分析过:Boost's data-driven tests' join operator `+`rupts first column

于 2018-04-09T12:45:13.477 回答