6

I have a very frustrating problem. My application runs on a few machines flawlessly for a month. However, there is one machine on which my application crashes nearly every day because of segfault. It always crashes at the same instruction address:

segfault at 7fec33ef36a8 ip 000000000041c16d sp 00007fec50a55c80 error 6 in myapp[400000+f8000]

This address points to memcpy call.

Below, there is an excerpt #1 from my app:

....
uint32_t size = messageSize - sizeof(uint64_t) + 1;

stack->trcData = (char*)Realloc(stack->trcData,(stack->trcSize + size + sizeof(uint32_t)));
char* buffer = stack->trcData + stack->trcSize;

uint32_t n_size = htonl(size);
memcpy(buffer,&n_size,sizeof(uint32_t)); /* ip 000000000041c16d points here*/
buffer += sizeof(uint32_t);

....
stack->trcSize += size + sizeof(uint32_t);
....

where stack is a structure:

struct Stack{
  char*     trcData;    
  uint32_t  trcSize;    
  /* ... some other elements */
};

and Realloc is a realloc wrapper:

#define Realloc(x,y)    _Realloc((x),(y),__LINE__)

void* _Realloc(void* ptr,size_t size,int line){

  void *tmp = realloc(ptr,size);
  if(tmp == NULL){
    fprintf(stderr,"R%i: Out of memory: trying to allocate: %lu.\n",line,size);
    exit(EXIT_FAILURE);
  }
  return tmp;
}

messageSize is of uint32_t type and its value is always greater than 44 bytes. The code #1 runs in a loop. stack->trcData is just a buffer which collects some data until some condition is fulfilled. stack->trcData is always initialized to NULL. The application is compiled with gcc with optimization -O3 enabled. When I run it in gdb, of course it did not crash, as I expected;)

I ran out of ideas why myapp crashes during memcpy call. Realloc returns with no error, so I guess it allocated enough space and I can write to this area. Valgrind

valgrind --leak-check=full --track-origins=yes --show-reachable=yes myapp

shows absolutely no invalid reads/writes.

Is it possible that on this particular machine the memory itself is corrupted and it causes these often crashes? Or maybe I corrupt memory somewhere else in myapp, but if this is the case, why it does not crash earlier, when the invalid write is made?

Thanks in advance for any help.

Assembly piece:

41c164: 00 
41c165: 48 01 d0                add    %rdx,%rax
41c168: 44 89 ea                mov    %r13d,%edx
41c16b: 0f ca                   bswap  %edx
41c16d: 89 10                   mov    %edx,(%rax)
41c16f: 0f b6 94 24 47 10 00    movzbl 0x1047(%rsp),%edx
41c176: 00

I'm not sure whether this information is relevant but all the machines, my application runs on successfully, have Intel processors whilst the one causing the problem has AMD.

4

1 回答 1

0

Here is the cause of my problem. The point is that at some loop step stack->trcSize + size exceeds UINT32_MAX. That means Realloc in fact shrinks stc->trcData. Next, I define buffer which now is far behind the allocated area. Hence, when I write to buffer I get segfault. I've checked it and it was indeed the cause.

于 2013-09-12T11:52:23.807 回答