2

I occasionally have Apache processes that go to 100% of CPU usage and never quit. When this happens 8 times (8 CPUs), the server becomes unusable. According to server-status, the "hung" process is a fairly complex custom Perl program, but when I print out warnings to the error log within the Perl, it shows that the process always goes to completion and returns, but apparently after returning, it goes into a loop or something. When I run strace on the process, it just shows tons of mmap2/munmap lines, e.g.:

mmap2(NULL, 4329472, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb42d7000
munmap(0xb42d7000, 4329472)             = 0
mmap2(NULL, 4329472, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb42d7000
munmap(0xb42d7000, 4329472)             = 0
mmap2(NULL, 4329472, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb42d7000
munmap(0xb42d7000, 4329472)             = 0
mmap2(NULL, 4329472, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb42d7000
munmap(0xb42d7000, 4329472)             = 0
mmap2(NULL, 4329472, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb42d7000
munmap(0xb42d7000, 4329472)             = 0
mmap2(NULL, 4329472, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb42d7000
munmap(0xb42d7000, 4329472)             = 0
mmap2(NULL, 4329472, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb42d7000
munmap(0xb42d7000, 4329472)             = 0
mmap2(NULL, 4329472, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb42d7000
munmap(0xb42d7000, 4329472)             = 0
mmap2(NULL, 4329472, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb42d7000
munmap(0xb42d7000, 4329472)             = 0
mmap2(NULL, 4329472, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb42d7000
munmap(0xb42d7000, 4329472)             = 0
mmap2(NULL, 4329472, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb42d7000
munmap(0xb42d7000, 4329472)             = 0
mmap2(NULL, 4329472, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb42d7000
munmap(0xb42d7000, 4329472)             = 0
mmap2(NULL, 4329472, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb42d7000
munmap(0xb42d7000, 4329472)             = 0
mmap2(NULL, 4329472, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb42d7000
mremap(0xb3d7c000, 4329472, 4333568, MREMAP_MAYMOVE) = 0xb3d7c000
munmap(0xb42d7000, 4329472)             = 0
mmap2(NULL, 4333568, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb42d6000
munmap(0xb42d6000, 4333568)             = 0
mmap2(NULL, 4333568, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb42d6000
munmap(0xb42d6000, 4333568)             = 0
mmap2(NULL, 4333568, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb42d6000
munmap(0xb42d6000, 4333568)             = 0
mmap2(NULL, 4333568, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb42d6000
munmap(0xb42d6000, 4333568)             = 0
mmap2(NULL, 4333568, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb42d6000
munmap(0xb42d6000, 4333568)             = 0
mmap2(NULL, 4333568, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb42d6000
munmap(0xb42d6000, 4333568)             = 0
mmap2(NULL, 4333568, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb42d6000
munmap(0xb42d6000, 4333568)             = 0
mmap2(NULL, 4333568, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb42d6000
munmap(0xb42d6000, 4333568)             = 0
mmap2(NULL, 4333568, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb42d6000
munmap(0xb42d6000, 4333568)             = 0
mmap2(NULL, 4333568, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb42d6000
munmap(0xb42d6000, 4333568)             = 0
mmap2(NULL, 4333568, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb42d6000
munmap(0xb42d6000, 4333568)             = 0
mmap2(NULL, 4333568, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb42d6000
munmap(0xb42d6000, 4333568)             = 0
mmap2(NULL, 4333568, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb42d6000

I have no idea what that means. It does this indefinitely. Is there any way to get a higher-level view of what it's doing? Has anyone seen anything similar to this?

Also, while it is usually a fairly random happenstance, if I issue an rflush() before exiting the Perl program, this almost always happens.

I'm using mod_perl/2.0.7, perl/5.12.4, apache/2.2.24. This was happening in a few minor versions down as well; I upgraded and it didn't improve anything. I'm also using DBI, DBD:ODBC.

My best guess is some kind of contention/race condition, but tracing the code with "warn" output indicates no such problem within the Perl itself. The Perl code also uses evals with timeout alarm signals that never get tripped, so it doesn't seem possible to be an issue with the Perl code.

Any ideas would be much appreciated.

4

1 回答 1

0

I think I've solved it. Still don’t know why rflush() would have made it worse, nor did I find any higher level view of the strace results, but this is why it was doing it:

At some point, I added the following piece of code to the “error” subroutine, which is called any time the program detects an error internal to itself:

my $caller = "";
my $x = 1;
while (caller($x) && $x < 10)
{
        my $subroutine = (caller($x))[3];
        $subroutine =~ s/^.*::([^:]+?)$/$1/gis;
        my $line = (caller($x))[2];
        $caller = qq~->$subroutine(line $line)~.$caller if $subroutine;
}
$caller = "main".$caller;

All this does is print out which subroutine the error occurred in. So, for instance if in the main body of the program, the subroutine “sub1” is called from line 1234, and from there “sub2” is called from line 2345, and an error occurs in “sub2” on line 3456, then the “error” subroutine will note that the error occurred in “main(line 1234)->sub1(line 2345)->sub2(line 3456)”. This is important for debugging, obviously. Unfortunately, it doesn’t help debugging the “error” subroutine itself!

In the “while” loop, it checks to make sure that $x < 10 (there shouldn’t be 10 levels of subroutine calls). This is meant to prevent run-away iteration. Unfortunately, the line to actually increment the variable $x is missing. What this means is that it’s going to keep checking for $x < 10 indefinitely because $x always = 1. The tricky part about this is that the “error” subroutine is called from a signal trap handler, which apparently forks from the main program itself, so the “error” subroutine can run in this infinite loop while the body of the program appears to finish. That made it impossible to “print” or “warn” a linear view of what was happening, leading to great confusion. That also made it so that this infinite loop ran outside of my main “eval” which has a timeout on it, so the timeout never stopped the loop like it was expected to. The program completed to the very last line and output, but the process which was serving it just kept going, eating up 100% of the CPU due to this infinite loop in the signal handler.

The solution is just four characters: “$x++”. Adding that inside the while loop prevents the infinite iteration and actually allows the debugging info from the signal handler to print out.

于 2013-04-22T21:41:57.687 回答