1. Problem Background
Recently a core dump occurred on one of our on-line search server. The core happens in memset()
due to the attempt to write to an invalid address, and hence received the SIGSEGV signal. The following information is from dmsg:
is_searcher_ser[17405]: segfault at 000000002c32a668 rip 0000003da0a7b006 rsp 0000000053abc790 error 6
The environment of our on-line servers goes as follows:
- OS: RHEL 5.3
- Kernel: 2.6.18-131.el5.custom, x86_64 (64-bit)
- GCC: 4.1.2 20080704 (Red Hat 4.1.2-44)
- Glibc: glibc-2.5-49.6
The following is the relevant code snippet:
CHashMap<…>::CHashMap(…)
{
…
typedef HashEntry *HashEntryPtr;
m_ppEntry = new HashEntryPtr[m_nHashSize]; // m_nHashSize is 389 when core
assert(m_ppEntry != NULL);
memset(m_ppEntry, 0x0, m_nHashSize*sizeof(HashEntryPtr)); // Core in this memset() invocation
…
}
The assembly code of the above code is:
…
0x000000000091fe9e <+110>: callq 0x502638 <_Znam@plt> // new HashEntryPtr[m_nHashSize]
0x000000000091fea3 <+115>: mov 0xc(%rbx),%edx // Get the value of m_nHashSize
0x000000000091fea6 <+118>: mov %rax,%rdi // Put m_ppEntry pointer to %rdi for later memset invocation
0x000000000091fea9 <+121>: mov %rax,0x20(%rbx) // Store the pointer to m_ppEntry member variable(%rbx holds the this pointer)
0x000000000091fead <+125>: xor %esi,%esi // Generate 0
0x000000000091feaf <+127>: shl $0x3,%rdx // m_nHashSize*sizeof(HashEntryPtr)
0x000000000091feb3 <+131>: callq 0x502b38 <memset@plt> // Call the memset() function
…
In the core dump, the assembly of memset@plt
is:
(gdb) disassemble 0x502b38
Dump of assembler code for function memset@plt:
0x0000000000502b38 <+0>: jmpq *0x771b92(%rip) # 0xc746d0 <memset@got.plt>
0x0000000000502b3e <+6>: pushq $0x53
0x0000000000502b43 <+11>: jmpq 0x5025f8
End of assembler dump.
(gdb) x/ag 0x0000000000502b3e+0x771b92
0xc746d0 <memset@got.plt>: 0x3da0a7acb0 <memset>
(gdb) disassemble 0x3da0a7acb0
Dump of assembler code for function memset:
0x0000003da0a7acb0 <+0>: cmp $0x1,%rdx
0x0000003da0a7acb4 <+4>: mov %rdi,%rax
…
For the above GDB analysis, we know that the address of memset()
has been resolved in the relocation PLT table. That is to say, the first jmpq *0x771b92(%rip)
will directly jump to the first instruction of function memset()
. Besides, the program had run nearly one day on-line, the relocation address of memset()
should have been already resolved earlier.
2. Weird phenomenon
This core fired at the instruction => 0x0000003da0a7b006 <+854>: mov %rdx,-0x8(%rdi)
in the memset()
. Actually this is the instruction in the memset()
to set the 0
at the right begin position of the buffer which is the first parameter of memset()
.
When cored , in frame 0, the value of $rdi
is 0x2c32a670
,and $rax
is 0x2c32a668
. From the assembly analysis and off-line test, $rax
should hold the source buffer of the memset
, i.e., the first parameter of memset()
.
So, in our example, $rax
should be same as the address of m_ppEntry
, the value of which is stored in the this
object (this
pointer is stored in %rbx
) first before it is zeroed by memset
later. However, the value of m_ppEntry
is 0x2ab02c32a668
.
Then use info files
GDB command to check, the address 0x2c32a668
is indeed invalid (not mapped), and address 0x2ab02c32a668
is a valid address.
3. Why it is weird?
The weird place of this core is that: If the real address of memset
has been resolved already(very very probably), then there are only very few instructions between the operation to put the pointer value into m_ppEntry
and the attempt to memset
it. And actually the value of register $rax
(holding the passed buffer address) are not changed at all during these instructions. So, how can m_ppEntry
isn’t equal to $rax
?
What is weird More is that: when core, the value of $rax
(0x2c32a668
) is actually the value of lower 4 bytes of m_ppEntry
(0x2ab02c32a668
). If there is indeed some relationship between the two values, is the m_ppEntry
parameter passed to memset
being truncated? However, the involved several instructions all use %rax
, rather than %eax
. By the way, I cannot reproduce this issue offline.
So,
1) Which address is valid? If 0x2c32a668
is valid? Is the heap corrupted just between the several instructions? And how to paraphrase that the value of m_ppEntry
is 0x2ab02c32a668
, and why the low 4 bytes of this two value is the same?
2) If 0x2ab02c32a668
is valid, why the address is truncated when passed into the 64-bit memset()
? Under which condition this error will occur? I cannot reproduce this offline. Is this issue an known bug? I didn't find it through Google.
3) Or, is it due to some hardware or power issue to make the 4 higher bytes of %rdi
passed to memset
zeroed? (I’m very very reluctant to believe this).
At last, any comment on this core is appreciated.
Thanks,
Gary Hu