我有一个 rails 应用程序,它运行在 ec2 服务器(m1.small 和 amazon linux)上的自制Web 服务器中,带有 AWS-rds/mysql 数据库(t1.micro)。这运行了好几天(今天早上过去 30 天的正常运行时间约为 99.9%)。
但是有时应用程序会卡住大约 14 分钟(应用程序由 pingdom 监控)。当它发生时,它通常分批发生。今天我已经有这个问题4次了。当我足够快时,我可以登录到服务器,安装 gdb 并将调试器附加到 Web 服务器。堆栈的顶部看起来像这样:
thread 1.
(gdb) bt
#0 0x00007fafa28b154d in read () from /lib64/libpthread.so.0
#1 0x00007faf98736332 in ?? () from /usr/lib64/mysql/libmysqlclient.so.18
#2 0x00007faf9872841f in ?? () from /usr/lib64/mysql/libmysqlclient.so.18
#3 0x00007faf98728ffa in ?? () from /usr/lib64/mysql/libmysqlclient.so.18
#4 0x00007faf98722615 in ?? () from /usr/lib64/mysql/libmysqlclient.so.18
#5 0x00007faf98726254 in ?? () from /usr/lib64/mysql/libmysqlclient.so.18
#6 0x00007faf9871e30d in mysql_ping () from /usr/lib64/mysql/libmysqlclient.so.18
#7 0x00007faf98be1aed in nogvl_ping (ptr=0x47a1ec0) at client.c:627
#8 0x00007fafa2c59c29 in rb_thread_blocking_region () from /home/ec2-user/.rvm/rubies/ruby-1.9.3-p327/lib/libruby.so.1.9
#9 0x00007faf98be1b5d in rb_mysql_client_ping (self=70801240) at client.c:636
#10 0x00007fafa2c3f108 in call_cfunc () from /home/ec2-user/.rvm/rubies/ruby-1.9.3-p327/lib/libruby.so.1.9
#11 0x00007fafa2c3fa0d in vm_call_cfunc () from /home/ec2-user/.rvm/rubies/ruby-1.9.3-p327/lib/libruby.so.1.9
#12 0x00007fafa2c400d3 in vm_call_method () from /home/ec2-user/.rvm/rubies/ruby-1.9.3-p327/lib/libruby.so.1.9
#13 0x00007fafa2c45987 in vm_exec_core () from /home/ec2-user/.rvm/rubies/ruby-1.9.3-p327/lib/libruby.so.1.9
#14 0x00007fafa2c52d2a in vm_exec () from /home/ec2-user/.rvm/rubies/ruby-1.9.3-p327/lib/libruby.so.1.9
#15 0x00007fafa2c516af in invoke_block_from_c () from /home/ec2-user/.rvm/rubies/ruby-1.9.3-p327/lib/libruby.so.1.9
#16 0x00007fafa2c517c5 in vm_yield () from /home/ec2-user/.rvm/rubies/ruby-1.9.3-p327/lib/libruby.so.1.9
mysql版本是5.5。aws 提供的数据库日志中没有条目。rails 日志只有 14 分钟的间隔(xxx/auto_test 是 AWS 负载均衡器每 10 秒检查一次实例的 url):
Started GET "/xxx/auto_test" for 10.224.95.251 at 2013-02-06 17:59:32 +0000
Processing by HealthCheckController#status as */*
Rendered health_check/status.html.erb within layouts/application (0.1ms)
Rendered layouts/_render_flash.html.erb (0.1ms)
Rendered layouts/_debug_info.html.erb (0.0ms)
Completed 200 OK in 8ms (Views: 6.0ms | ActiveRecord: 1.3ms)
Started GET "/xxx/auto_test" for 10.224.95.251 at 2013-02-06 18:13:38 +0000
Processing by HealthCheckController#status as */*
Rendered health_check/status.html.erb within layouts/application (0.1ms)
Rendered layouts/_render_flash.html.erb (0.1ms)
Rendered layouts/_debug_info.html.erb (0.0ms)
Completed 200 OK in 7ms (Views: 5.5ms | ActiveRecord: 1.2ms)
在那次中断期间,来自负载均衡器的请求会堆积起来并得到响应,此时数据库不再阻塞。
什么可能导致数据库阻塞?我必须查找哪些信息才能解决此问题?任何解决方法的建议?欢迎任何指点并高度赞赏!
更新:
我今天又看到了这个问题。中断持续了整整 14 分钟,我附加了一个调试器并获得了完全相同的回溯。因此,使用本机 MySql 超时并不能缓解问题。
iptables -L
也没有表现出任何有趣的东西。
14分钟,14分钟能是什么?41 至少会接近 42,但是 14,嗯...