I have a very frustrating problem. My application runs on a few machines flawlessly for a month. However, there is one machine on which my application crashes nearly every day because of segfault. It always crashes at the same instruction address:
segfault at 7fec33ef36a8 ip 000000000041c16d sp 00007fec50a55c80 error 6 in myapp[400000+f8000]
This address points to memcpy
call.
Below, there is an excerpt #1 from my app:
....
uint32_t size = messageSize - sizeof(uint64_t) + 1;
stack->trcData = (char*)Realloc(stack->trcData,(stack->trcSize + size + sizeof(uint32_t)));
char* buffer = stack->trcData + stack->trcSize;
uint32_t n_size = htonl(size);
memcpy(buffer,&n_size,sizeof(uint32_t)); /* ip 000000000041c16d points here*/
buffer += sizeof(uint32_t);
....
stack->trcSize += size + sizeof(uint32_t);
....
where stack
is a structure:
struct Stack{
char* trcData;
uint32_t trcSize;
/* ... some other elements */
};
and Realloc
is a realloc
wrapper:
#define Realloc(x,y) _Realloc((x),(y),__LINE__)
void* _Realloc(void* ptr,size_t size,int line){
void *tmp = realloc(ptr,size);
if(tmp == NULL){
fprintf(stderr,"R%i: Out of memory: trying to allocate: %lu.\n",line,size);
exit(EXIT_FAILURE);
}
return tmp;
}
messageSize
is of uint32_t
type and its value is always greater than 44 bytes. The code #1 runs in a loop. stack->trcData
is just a buffer which collects some data until some condition is fulfilled. stack->trcData
is always initialized to NULL
. The application is compiled with gcc
with optimization -O3
enabled. When I run it in gdb
, of course it did not crash, as I expected;)
I ran out of ideas why myapp crashes during memcpy
call. Realloc
returns with no error, so I guess it allocated enough space and I can write to this area. Valgrind
valgrind --leak-check=full --track-origins=yes --show-reachable=yes myapp
shows absolutely no invalid reads/writes.
Is it possible that on this particular machine the memory itself is corrupted and it causes these often crashes? Or maybe I corrupt memory somewhere else in myapp, but if this is the case, why it does not crash earlier, when the invalid write is made?
Thanks in advance for any help.
Assembly piece:
41c164: 00
41c165: 48 01 d0 add %rdx,%rax
41c168: 44 89 ea mov %r13d,%edx
41c16b: 0f ca bswap %edx
41c16d: 89 10 mov %edx,(%rax)
41c16f: 0f b6 94 24 47 10 00 movzbl 0x1047(%rsp),%edx
41c176: 00
I'm not sure whether this information is relevant but all the machines, my application runs on successfully, have Intel processors whilst the one causing the problem has AMD.