我在 R 函数中嵌入了一些 C 代码,它以相同的方式保持 sigsegging,但在不同的点(通过程序进展 - 似乎总是来自同一个函数)。
事情是这样的——我得到的错误是;
*** glibc detected *** /packages/R/2.15.0/lib64/R/bin/exec/R: munmap_chunk():
invalid pointer: 0x0000000014059b20 ***
现在这是一个非常标准的错误(如果我记得的话,它munmap_chunk()
是其中的一部分free()
)——奇怪的是,错误来自一个函数,该函数正在从一个结构中释放一组数组(程序分配并释放数百万个这些结构)其正常运行的过程)。
函数如下所示;
multifit_work_t *free_multifit(multifit_work_t *work)
{
if (work == NULL || work->u==NULL || work->w==NULL || work->v==NULL || work->b==NULL || work->rv1==NULL) {
fprintf(stderr,"ERROR: Internal array in multifit_work_t object was already NULL\n");
exit(1);
}
// each of the work->* arrays are just an array of doubles of length 1 or more.
// LOGGING FUNCTIONALITY: Here, prints out the address and values of each
// of the arrays
// free each array first
free(work->u);
free(work->w);
free(work->v);
free(work->b);
free(work->rv1);
free(work);
// LOGGING FUNCTIONALITY: Here prints an, "Exiting free_multifit()" message
return NULL;
}
所以我在释放它之前检查每个指针。我添加了日志功能以输出每个数组的地址和初始值。为有问题的指针生成上述错误的崩溃日志文件,我得到了很多命中(可以理解,我们在释放后重新使用相同的内存位置);
$: grep 14059b20 logfile.txt
....
194624) work->b: ADDRESS: [0x14059b20] VALUE: [-5.620804e-02]
194629) work->b: ADDRESS: [0x14059b20] VALUE: [2.759472e+00]
194634) work->b: ADDRESS: [0x14059b20] VALUE: [5.498979e-02]
194684) work->b: ADDRESS: [0x14059b20] VALUE: [9.323869e+07]
194689) work->b: ADDRESS: [0x14059b20] VALUE: [3.016410e+07]
194694) work->b: ADDRESS: [0x14059b20] VALUE: [1.688376e-08]
194699) work->b: ADDRESS: [0x14059b20] VALUE: [1.660441e+00]
.....
操作 194699 是我在段错误之前获得的最后一组值;
Calling free_multifit...
194696) work->u: ADDRESS: [0x1305f7d0] VALUE: [1.350474e+01]
194697) work->w: ADDRESS: [0x92ec810] VALUE: [1.350474e+01]
194698) work->v: ADDRESS: [0x122cc210] VALUE: [5.798884e-09]
194699) work->b: ADDRESS: [0x14059b20] VALUE: [1.660441e+00]
194700) work->rv1: ADDRESS: [0xea37a50] VALUE: [0.000000e+00]
< If it didn't crash in the function we'd see an "Exiting function message" here - so it sigsegs on the freeing of one the the arrays or the work object itself.
[EOF]
因此,尽管检查指针是好的,并且实际上从它的位置(1.66)中提取了一个值,但当我尝试释放它时似乎一切都出错了。
任何想法为什么/如何发生这种情况?这是硬件问题吗?我在集群上运行它,如果这有什么不同的话。
更新 1
multifit_work_t 通过以下方式创建;
typedef struct {
int m,n;
double *w,*u,*v,*b,*rv1;
} multifit_work_t;
multifit_work_t *alloc_multifit(int m, int n)
{
multifit_work_t *work=(multifit_work_t *)malloc(sizeof(multifit_work_t));
if (work==NULL) {
fprintf(stderr,"failed to allocate multifit_work\n");
exit(1);
}
work->m=m;
work->n=n;
work->u=(double *)malloc(n*m*sizeof(double)); /* temporary storage - n x m matrix */
work->w=(double *)malloc(n*sizeof(double)); /* n vector */
work->v=(double *)malloc(n*n*sizeof(double)); /* n x n matrix */
work->b=(double *)malloc(m*sizeof(double)); /* m vector */
work->rv1=(double *)malloc(n*sizeof(double)); /* temporary storage - n vector */
if (work->u==NULL || work->w==NULL || work->v==NULL || work->b==NULL || work->rv1==NULL) {
fprintf(stderr,"failed to allocate multifit_work\n");
exit(1);
}
return work;
}
更新 2
当我在本地系统上运行它时,会发生同样的事情,但错误是这样的;
*** caught segfault ***
address 0x11e000000, cause 'memory not mapped'
总是在一个明显均匀的内存地址。
更新 3
以下是 valgrind 报告
valgrind --leak-check=full --show-reachable=yes ./execute
==23072== Memcheck, a memory error detector
==23072== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==23072== Using Valgrind-3.6.0.SVN-Debian and LibVEX; rerun with -h for copyright info
==23072== Command: ./execute
==23072==
==23072==
==23072== HEAP SUMMARY:
==23072== in use at exit: 0 bytes in 0 blocks
==23072== total heap usage: 445 allocs, 445 frees, 27,900 bytes allocated
==23072==
==23072== All heap blocks were freed -- no leaks are possible
==23072==
==23072== For counts of detected and suppressed errors, rerun with: -v
==23072== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 23 from 8)
这要死我了!