11

刚输入函数时会导致分段错误的原因是什么?

输入的函数如下所示:

21:  void eesu3(Matrix & iQ)
22:  {

哪里Matrixstruct。使用 GDB 运行时,回溯会产生:

(gdb) backtrace 
#0  eesu3 (iQ=...) at /home/.../eesu3.cc:22
#1  ...

GDB 没有说是什么iQ。从...字面上看。什么可能导致这种情况?

GCC:(Ubuntu/Linaro 4.6.3-1ubuntu5)4.6.3

构建的程序-O3 -g

调用者是这样的:

Matrix q;
// do some stuff with q
eesu3(q);

这里没什么特别的

我用 valgrind 重新运行了程序:

valgrind --tool=memcheck --leak-check=yes --show-reachable=yes --num-callers=20 --track-fds=yes <prgname>

输出:

==2240== Warning: client switching stacks?  SP change: 0x7fef7ef68 --> 0x7fe5e3000
==2240==          to suppress, use: --max-stackframe=10076008 or greater
==2240== Invalid write of size 8
==2240==    at 0x14C765B: eesu3( Matrix &) (eesu3.cc:22)
...
==2240==  Address 0x7fe5e3fd8 is on thread 1's stack
==2240== 
==2240== Can't extend stack to 0x7fe5e2420 during signal delivery for thread 1:
==2240==   no stack segment
==2240== 
==2240== Process terminating with default action of signal 11 (SIGSEGV)
==2240==  Access not within mapped region at address 0x7FE5E2420
==2240==    at 0x14C765B: eesu3( Matrix&) (eesu3.cc:22)
==2240==  If you believe this happened as a result of a stack
==2240==  overflow in your program's main thread (unlikely but
==2240==  possible), you can try to increase the size of the
==2240==  main thread stack using the --main-stacksize= flag.
==2240==  The main thread stack size used in this run was 8388608.

看起来它是一个损坏的堆栈。

    Dump of assembler code for function eesu3( Matrix & ):
   0x00000000014c7640 <+0>: push   %rbp
   0x00000000014c7641 <+1>: mov    %rsp,%rbp
   0x00000000014c7644 <+4>: push   %r15
   0x00000000014c7646 <+6>: push   %r14
   0x00000000014c7648 <+8>: push   %r13
   0x00000000014c764a <+10>:    push   %r12
   0x00000000014c764c <+12>:    push   %rbx
   0x00000000014c764d <+13>:    and    $0xfffffffffffff000,%rsp
   0x00000000014c7654 <+20>:    sub    $0x99b000,%rsp
=> 0x00000000014c765b <+27>:    mov    %rdi,0xfd8(%rsp)

好的,明确一点:Matrix 的数据存在于堆中。它基本上包含一个指向数据的指针。该结构很小,只有 32 个字节。(刚刚检查)

现在,我用不同的优化选项重建了程序:

-O0: 错误不显示。

-O1: 错误确实显示。

-O3: 错误确实显示。

- 更新

-O3 -fno-inline -fno-inline-functions: 错误不显示。

That explains it. Too many inlines into the function led to excessive stack usage.

The problem was due to a stack overflow

4

4 回答 4

15

What can cause a segmentation fault when just entering a function?

The most frequent cause is stack exhaustion. Do (gdb) disas at crash point. If the instruction that crashed is the first read or write to a stack location after %rsp has been decremented, then stack exhaustion is almost definitely the cause.

Solution usually involves creating threads with larger stacks, moving some large variables from stack to heap, or both.

Another possible cause: if Matrix contains very large array, you can't put it on stack: the kernel will not extend stack beyond current by more than 128K (or so, I don't remember exact value). If Matrix is bigger than that limit, you can't put it on stack.

Update:

   0x00000000014c7654 <+20>:    sub    $0x99b000,%rsp
=> 0x00000000014c765b <+27>:    mov    %rdi,0xfd8(%rsp)

This disassembly confirms the diagnosis.

In addition, you are reserving 0x99b000 bytes on stack (that's almost 10MB). There must be some humongous objects you are trying to locate on stack in the eesu3 routine. Don't do that.

What do you mean by "the kernel will not extend stack beyond current by more than"

When you extend stack (decrement %rsp) by e.g. 1MB, and then try to touch that stack location, the memory will not be accessible (the kernel grows stack on-demand). This will generate a hardware trap, and transfer control to the kernel. When the kernel decides what to do, it looks at

  1. Current %rsp
  2. Meemory location that the application tried to access
  3. Stack limit for the current thread

If faulting address is below current %rsp, but within 128K (or some other constant of similar magnitude), the kernel simply extends the stack (provided such extension will not exceed the stack limit).

If the faulting address is more than 128K below current %rsp (as appears to be the case here), you get SIGSEGV.

This all works nicely for most programs: even if they use a lot of stack in a recursive procedure, they usually extend stack in small chunks. But an equivalent program that tried to reserve all that stack in a single routine would have crashed.

Anyway, do (gdb) info locals at crash point, and see what locals might be requiring 10MB of stack. Then move them to heap.

Update 2:

No locals

Ah, the program has probably not made it far enough into eesu3 for there to be locals.

when building with -O0 the error disappears. GCC bug?

It could be a GCC bug, but more likely it's just that GCC is inlining a lot of other routines into eesu3, and each of the inlined routines needs its own N KBs of stack. Does the problem disappear if you build the source containing eesu3 with -fno-inline ?

Unfortunately, triage of such behavior and figuring out appropriate workarounds, or fixing GCC, requires compiler expertise. You could start by compiling with -fdump-tree-all and looking at generated <source>.*t.* files. These contain textual dumps of GCC internal representation at various stages of the compilation process. You may be able to understand enough of it to make further progress.

于 2012-05-08T14:13:16.710 回答
5

It's a stack overflow.

eesu3 tries to allocate something very large on the stack, which can be seen in its assembly code:

sub    $0x99b000,%rsp

This means more than 10MB of stack space are consumed.

The problem can be in eesu3 or an a function it calls, and the compiler chooses to inline.

My guess is that the problem is in a function eesu3 call, but not in the case you test (a debugging function?)
I guess this because it doesn't happen without optimization - with optimization, the function is inlined into eesu3, so eesu3 uses lots of stack. Without it, the function is not inline, so you'll get a problem only when it's actually called.

于 2012-05-08T15:17:07.527 回答
0

If its a Matrix, check the indices you are trying to access. Maybe you are accessing elements that go beyond the dimensions of the Matrix object?

于 2012-05-08T13:42:14.783 回答
0

You probably have some variables initialized in function

void eesu3(Matrix & iQ)

Allthough the debugger might step through variable declarations, they are probably initialized with the start of the scope (that is your function). If you would declare a very large buffer like so:

char * buffer[268435456];

You could get a stack overflow. It might be better to allocate some memory like

void * pvBuffer = malloc(268435456);

Have you declared a large buffer? Which is too large to put on the stack? It could mean that different architectures result in different possible maximum sizes for buffers (64 bit and 32 bit OSes)? Different kernels? As you said that the program runs fine on one machine but not on the other.

于 2012-10-11T10:30:04.987 回答