Nginx监控脚本所谓的ztc加载nginx测试页面失败(主要是在nginx的最高负载下大约2000rps,用作代理),导致zabbix上出现“nginx is down”之类的错误,一秒钟后,一切似乎没事。
[NginxStatus] 2015-12-16 20:24:55,289 - ERROR: failed to load test page
Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/ztc/nginx/__init__.py", line 56, in _read_status
u = urllib2.urlopen(url, None, 1)
File "/usr/lib64/python2.6/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib64/python2.6/urllib2.py", line 391, in open
response = self._open(req, data)
File "/usr/lib64/python2.6/urllib2.py", line 409, in _open
'_open', req)
File "/usr/lib64/python2.6/urllib2.py", line 369, in _call_chain
result = func(*args)
File "/usr/lib64/python2.6/urllib2.py", line 1190, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib64/python2.6/urllib2.py", line 1165, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
由于它仅在最高负载下发生,大约 2000 rps,我将其与一些内核参数相关联,这些参数导致了这种情况。
这是nginx配置:
user nginx;
worker_processes 4;
timer_resolution 100ms;
worker_priority -15;
worker_rlimit_nofile 200000;
error_log /var/log/nginx/error.log;
pid /var/run/nginx.pid;
events {
worker_connections 65536;
use epoll;
multi_accept on;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
server_tokens off;
access_log /var/log/nginx/access.log;
sendfile on;
tcp_nopush on;
tcp_nodelay on;
# keepalive_requests 120;
# keepalive_timeout 65;
gzip on;
gzip_http_version 1.0;
gzip_comp_level 2;
gzip_proxied any;
gzip_vary off;
gzip_types text/plain text/css application/x-javascript text/xml application/xml application/rss+xml application/atom+xml text/javascript application/javas$
ript application/json text/mathml;
gzip_min_length 1000;
gzip_disable "MSIE [1-6]\.";
variables_hash_max_size 1024;
variables_hash_bucket_size 64;
server_names_hash_bucket_size 64;
types_hash_max_size 2048;
types_hash_bucket_size 64;
include /etc/nginx/conf.d/*.conf;
include /etc/nginx/sites-enabled/*;
}
这是 sysctl.conf
net.ipv4.conf.all.secure_redirects=0
net.ipv4.conf.all.send_redirects=0
net.ipv4.tcp_max_syn_backlog=20480
net.ipv4.tcp_synack_retries=2
net.ipv4.tcp_rmem=4096 87380 16777216
net.ipv4.tcp_wmem=4096 65536 16777216
net.netfilter.nf_conntrack_max=1048576
net.nf_conntrack_max=1048576
net.ipv4.tcp_no_metrics_save=1
net.ipv4.tcp_tw_reuse=1
net.core.somaxconn=15000
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.ipv4.tcp_keepalive_time=60
net.ipv4.tcp_keepalive_intvl=15
net.ipv4.tcp_keepalive_probes=5
net.ipv4.tcp_max_tw_buckets=720000
net.ipv4.tcp_tw_recycle=1
net.ipv4.tcp_timestamps=1
net.ipv4.tcp_fin_timeout=30
和 netstat 输出:
netstat -an | grep -e :80 -e :443 |awk '/^tcp/ {A[$(NF)]++} END {for (I in A) {printf "%5d %s\n", A[I], I}}'
18525 TIME_WAIT
1 CLOSE_WAIT
499 FIN_WAIT1
1544 FIN_WAIT2
33311 ESTABLISHED
563 SYN_RECV
7 CLOSING
294 LAST_ACK
3 LISTEN
这可能是什么根本原因?2000rps 的 netstat 指标是否异常?我的 sysctl.conf 中是否有错误导致我的问题?