python - Django WSGI 多线程和连接问题

Question

我已按照 Graham 的说明使用 mod_wsgi 将 Apache 与 Django 集成：http ://blog.dscpl.com.au/2010/03/improved-wsgi-script-for-use-with.html ，但仍然遇到连接问题和响应时间。由于它是随机发生的，并且 Apache 日志文件中没有任何错误，因此很难理解发生了什么。

我的 Apache 是用 pre-fork 构建的，配置如下：

<IfModule prefork.c>
StartServers       8
MinSpareServers    5
MaxSpareServers   20
ServerLimit      256
MaxClients       256
MaxRequestsPerChild 0
</IfModule>

WSGI 相关配置：

LogLevel info
LoadModule wsgi_module modules/mod_wsgi.so
WSGISocketPrefix run/wsgix
WSGIDaemonProcess somename user=apache group=apache threads=25 processes=1
WSGIScriptAlias / /wsgi-dir/script.wsgi
WSGIImportScript /wsgi-dir/script.wsgi process-group=somename application-group=%{GLOBAL}

<Directory /wsgi-dir/script.wsgi>
   Order deny,allow
   Allow from all
   WSGIProcessGroup  somename
</Directory>

在请求处理器中，我使用以下内容来监视活动线程：

logger.info("Active threads: {0}".format(threading.active_count()))

我注意到尽管我在配置中最多有 25 个线程，但活动线程数永远不会大于 4，同时一些客户端可以等待新连接超过 1 分钟，而请求处理时间约为 2 秒。

如果一个请求到达服务器，它会被快速处理，但在某些情况下（大约每 100 个请求中有 1 个）客户端只是等待连接，并且由于 Apache 的限制有时甚至会超时：

Timeout 60

我认为，这种行为在网络应用程序世界中很容易被忽视，其中 100 个请求中的 1 个不起重要作用（用户只需重新加载页面），但在服务世界中，这确实是一个问题。

我无法理解这一点——如果所有线程都忙于为其他客户端服务，为什么 Django 不产生另一个线程？如果不是关于线程，那么它可能是什么？格雷厄姆写的应用程序重新加载问题？

这是我的版本：

python26-2.6.8-3.30.amzn1.x86_64
Django-1.4.3
mod_wsgi-3.2-1.6.amzn1.x86_64
Server version: Apache/2.2.23 (Unix)
Server loaded:  APR 1.4.6, APR-Util 1.4.1
Compiled using: APR 1.4.6, APR-Util 1.4.1
Architecture:   64-bit
Server MPM:     Prefork
  threaded:     no
    forked:     yes (variable process count)

================== 实施格雷厄姆建议的第一次更新 ========================= ======

格雷厄姆等人，

感谢您的意见和建议。我检查了 mod_wsgi 版本，它是 3.2（见上文）。我的 WSGI 配置现在如下所示：

LogLevel info
LoadModule wsgi_module modules/mod_wsgi.so
WSGISocketPrefix run/wsgix
WSGIDaemonProcess somename user=apache group=apache threads=25
WSGIScriptAlias /  /wsgi-dir/script.wsgi process-group=somename application-group=%{GLOBAL}

<Directory /wsgi-dir>
   Order deny,allow
   Allow from all
</Directory>

启动 50 个 EC2 客户端，每个客户端在启动时间向服务发送几条消息就足够了，一个客户端的延迟为 49 秒，而所有其他客户端的平均响应为 2.2 秒，最多为7 秒。

我检查了应用程序日志文件，发现“收到的请求”和“发出的响应”之间的增量延迟请求为 0.16 秒，而从客户的角度来看，延迟为 49 秒。

它给我们留下了两种可能性：

客户端在将近 49 秒内无法建立连接
建立了连接，但服务器（实际上是 Django/WSGI 内部）无法快速读取请求。

很难说是 #1 还是 #2，因为我在客户端使用 Python 的“请求”模块来连接服务。我认为它是 #2，因为如果延迟稍微高于 64-65 秒，Apache 发送/接收超时就会启动，我可以在 Apache 的日志文件中看到它。

以下是我将尝试做的进一步澄清它：

创建一个简单的控制器，如下所示：

def 侦听器（请求）：

logger.info("Started, active threads: {0}".format(threading.active_count()))    
time.sleep(3)
logger.info("Finished, active threads: {0}".format(threading.active_count()))    
return HttpResponse('OK')

注意：记录器也会记录时间。

创建一个简单的统计接口（我不想分析所有客户端 EC2 上的日志）：

定义日志（请求）：

id = request.REQUEST['id']
time = request.REQUEST['time']
res = request.REQUEST['res']

if (id and time):
    logger.info("STAT Instance: {0}, Processing time: {1}, Response: {2}".format(id,time,res))

return HttpResponse('OK')

客户将像这样工作：

向“侦听器”URL 发送几个请求并计算客户端的处理时间
将处理时间与 EC2 实例 ID 一起发送到“日志”URL

如果我能够用这种简单的方法重现问题，它将变得可重现，我希望 Django 团队能够从那里解决问题。

任何其他建议也表示赞赏。非常感谢所有回答的人。

================== 建议测试的第二次更新 =========================== =====

我已经实现了建议的侦听器，并且可以重现该问题，并希望其他人也能这样做——您只需要一个 AWS 账户就可以启动大量 EC2 客户端——通常 50 个就足够了，但有时我需要到 100 才能看到延迟。

有趣的是，在这个测试中，活动线程的数量从 1 逐渐增加到 8，可能是因为服务器上的平均处理时间增加了，所以它确实有效，但仍然不足以防止延迟。

我将客户的脚本放入 EC2 的用户数据中，如下所示。如果您需要有关如何与所有这些客户一起创建 auto-sclaing 组的快速说明，请告诉我。

#!/bin/bash

do_send() {

        d1=`date +%s`
        res=`python ~ec2-user/client/fetch.py ${URL_ROOT}/init/`
        res=`echo $res | tr '\n' ' ' | tr ' ' +`
        d2=`date +%s`
        delta=`expr $d2 - $d1`
        echo $ami $ins $res $delta >>$LOG
        curl -s  "${URL_ROOT}/stat/?id=$ami&time=$delta&res=$ins:$res" >/dev/null 2>&1
}

URL_ROOT=<SERVICE-ROOT_URL>
LOG=~ec2-user/log.txt

ins=`curl -s http://169.254.169.254/latest/meta-data/instance-id 2>/dev/null`
ami=`curl -s http://169.254.169.254/latest/meta-data/ami-id 2>/dev/null`
echo "Instance=[$ins]" >$LOG

# First request
do_send

# Second request
do_send

fetch.py 客户端如下所示：

@author: ogryb
'''
import requests
import datetime
import socket

from optparse import OptionParser
usage = "usage: %prog [options] init_url\n   init_url - http://<host>/init/ server's address"
parser = OptionParser(usage=usage)
parser.add_option("-i", "--id", dest="id",
                  help="instance ID", metavar="STRING")
parser.add_option("-p", "--phost", dest="phost",
                  help="public hostname", metavar="STRING")
parser.add_option("-l", "--lhost", dest="lhost",
                  help="local hostname", metavar="STRING")
parser.add_option("-t", "--type", dest="type",
                  help="instance type", metavar="STRING")
parser.add_option("-q", "--quiet",
                  action="store_true", dest="quiet", default=False,
                  help="Quiet mode")
(opt, args) = parser.parse_args()
ip = socket.gethostbyname(socket.gethostname())
if (not opt.quiet):
    print ("=== Getting metadata:\t{0} {1}".format(datetime.datetime.utcnow(), ip))
if not opt.id:
    r = requests.get(url='http://169.254.169.254/latest/meta-data/instance-id')
    opt.id = r.text
if not opt.phost:
    r = requests.get(url='http://169.254.169.254/latest/meta-data/public-hostname')
    opt.phost = r.text
if not opt.lhost:
    r = requests.get(url='http://169.254.169.254/latest/meta-data/local-hostname')
    opt.lhost = r.text
if not opt.type:
    r = requests.get(url='http://169.254.169.254/latest/meta-data/instance-type')
    opt.type = r.text
body = "id={0}&phost={1}&lhost={2}&type={3}".format(opt.id, opt.phost, opt.lhost, opt.type)
if (not opt.quiet):
    print ("=== Start sending:\t{0} {1} {2}".format(datetime.datetime.utcnow(), ip, opt.id))
r = requests.post(url=args[0], data=body, verify=False)
if (not opt.quiet):
    print ("=== End sending:\t{0} {1} {2}".format(datetime.datetime.utcnow(), ip, opt.id))
print r.text
if (not opt.quiet):
    print "Request Body={0} url={1}".format(body,args[0])
    print "Response: {0}\n{1}".format(r.status_code, r.text)

============ 03/19/13 - 23:45 来自错误日志的附加信息 ===

我已将 Apache 日志级别更改为调试，并在 Apache error_log 中找到以下内容。请让我知道这是否可能是延误的原因以及对此可以采取的措施。我在某处读到“KeyError”是无害的，但你永远不知道。

一位客户在 6:37:28 延迟了 41 秒。错误日志最接近的事件发生在 @ 06:37:15：

Wed Mar 20 06:37:15 2013] [info] mod_wsgi (pid=27005): Initializing Python.
[Wed Mar 20 06:37:15 2013] [info] mod_wsgi (pid=27005): Attach interpreter '

完整的错误日志如下：

Wed Mar 20 06:29:45 2013] [info] Server built: Oct 21 2012 20:35:32
[Wed Mar 20 06:29:45 2013] [debug] prefork.c(1023): AcceptMutex: sysvsem (default: sysvsem)
[Wed Mar 20 06:29:45 2013] [info] mod_wsgi (pid=26891): Attach interpreter ''.
[Wed Mar 20 06:29:45 2013] [info] mod_wsgi (pid=26892): Attach interpreter ''.
[Wed Mar 20 06:29:45 2013] [info] mod_wsgi (pid=26893): Attach interpreter ''.
[Wed Mar 20 06:29:45 2013] [info] mod_wsgi (pid=26895): Attach interpreter ''.
[Wed Mar 20 06:29:45 2013] [info] mod_wsgi (pid=26894): Attach interpreter ''.
[Wed Mar 20 06:37:15 2013] [debug] proxy_util.c(1820): proxy: grabbed scoreboard slot 0 in child 27005 for worker proxy:reverse
[Wed Mar 20 06:37:15 2013] [debug] proxy_util.c(1839): proxy: worker proxy:reverse already initialized
[Wed Mar 20 06:37:15 2013] [debug] proxy_util.c(1936): proxy: initialized single connection worker 0 in child 27005 for (*)
[Wed Mar 20 06:37:15 2013] [info] mod_wsgi (pid=27005): Initializing Python.
[Wed Mar 20 06:37:15 2013] [info] mod_wsgi (pid=27005): Attach interpreter ''.
[Wed Mar 20 06:38:10 2013] [debug] proxy_util.c(1820): proxy: grabbed scoreboard slot 0 in child 27006 for worker proxy:reverse
[Wed Mar 20 06:38:10 2013] [debug] proxy_util.c(1839): proxy: worker proxy:reverse already initialized
[Wed Mar 20 06:38:10 2013] [debug] proxy_util.c(1936): proxy: initialized single connection worker 0 in child 27006 for (*)
[Wed Mar 20 06:38:10 2013] [info] mod_wsgi (pid=27006): Initializing Python.
[Wed Mar 20 06:38:10 2013] [info] mod_wsgi (pid=27006): Attach interpreter ''.
[Wed Mar 20 06:38:11 2013] [info] mod_wsgi (pid=26874): Destroying interpreters.
[Wed Mar 20 06:38:11 2013] [info] mod_wsgi (pid=26874): Cleanup interpreter ''.
[Wed Mar 20 06:38:11 2013] [info] mod_wsgi (pid=26874): Terminating Python.
[Wed Mar 20 06:38:11 2013] [error] Exception KeyError: KeyError(140627014572000,) in <module 'threading' from '/usr/lib64/python2.6/threading.pyc'> ignored
[Wed Mar 20 06:38:11 2013] [info] mod_wsgi (pid=26874): Python has shutdown.
[Wed Mar 20 06:38:44 2013] [debug] proxy_util.c(1820): proxy: grabbed scoreboard slot 0 in child 27007 for worker proxy:reverse
[Wed Mar 20 06:38:44 2013] [debug] proxy_util.c(1839): proxy: worker proxy:reverse already initialized
[Wed Mar 20 06:38:44 2013] [debug] proxy_util.c(1936): proxy: initialized single connection worker 0 in child 27007 for (*)
[Wed Mar 20 06:38:44 2013] [info] mod_wsgi (pid=27007): Initializing Python.
[Wed Mar 20 06:38:44 2013] [info] mod_wsgi (pid=27007): Attach interpreter ''.
[Wed Mar 20 06:38:45 2013] [info] mod_wsgi (pid=26880): Destroying interpreters.
[Wed Mar 20 06:38:45 2013] [info] mod_wsgi (pid=26880): Cleanup interpreter ''.
[Wed Mar 20 06:38:45 2013] [info] mod_wsgi (pid=26880): Terminating Python.
[Wed Mar 20 06:38:45 2013] [error] Exception KeyError: KeyError(140627014572000,) in <module 'threading' from '/usr/lib64/python2.6/threading.pyc'> ignored
[Wed Mar 20 06:38:45 2013] [info] mod_wsgi (pid=26880): Python has shutdown.

score 0 · Accepted Answer

配置：

<Directory /wsgi-dir/script.wsgi>
   Order deny,allow
   Allow from all
   WSGIProcessGroup  somename
</Directory>

应该：

<Directory /wsgi-dir>
   Order deny,allow
   Allow from all
   WSGIProcessGroup  somename
   WSGIApplicationGroup %{GLOBAL}
</Directory>

作为一个开始。

处理请求时，具有您所拥有的 WSGI 应用程序将以嵌入式模式而不是守护程序模式运行。那是因为 Directory 指令的路径是错误的，而不是应有的目录。您不能像以前那样使用文件路径。您应该使用以下方法验证它的运行位置：

http://code.google.com/p/modwsgi/wiki/CheckingYourInstallation#Embedded_Or_Daemon_Mode

同时，您将在守护进程模式进程中加载一个冗余副本，而这些副本从未被 WSGIImportScript 使用过。

即使在修复了指令路径之后，预加载也在不同的子解释器中处理请求。您需要上面给出的 WSGIApplicationGroup 以确保它们在同一个子解释器（应用程序组）中。

如果使用 mod_wsgi 3.0+，你最好删除 WSGIImportScript 指令，而是使用：

WSGIScriptAlias / /wsgi-dir/script.wsgi process-group=somename application-group=%{GLOBAL}

<Directory /wsgi-dir>
   Order deny,allow
   Allow from all
</Directory>

使用 WSGIScriptAlias 同时指定进程组和应用程序组，无需单独的 WSGIProcessGroup 和 WSGIApplicationGroup。指定两者还具有预加载脚本的副作用，从而替换了 WSGIImportScript 正在执行的操作，因此可以删除 WSGIImportScript 的原因。

至于为什么请求需要这么长时间的性能问题，您可以安装并尝试 New Relic 来挖掘问题所在。

对于一般的 Apache 配置，还建议您阅读我刚刚做的 PyCon US talk 中的幻灯片：

http://www.slideshare.net/GrahamDumpleton/pycon-us-2013-making-apache-suck-less-for-hosting-python-web-applications

视频应该在本周末发布。

我去年的 PyCon 演讲也应该很有趣。

http://lanyrd.com/2012/pycon/spcdg/

score 0 · Accepted Answer

很抱歉回答迟了，也没有提供太多细节，但这不是 Django 或 Apache 的问题。这是它运行的环境的问题。我确切地知道“环境”出了什么问题，但由于 NDA 不能透露更多细节。我希望你明白我所说的“环境”是什么意思。

另一个提示：当我在原始帖子中假设它是＃2时，我错了 - 它实际上是＃1。

我认为知道这不是 Django 或 Apache 问题对于仍在研究此问题的每个人来说都是一个巨大的帮助。我不在乎这个答案会得到多少反对或赞成，只是想提供帮助，因为我知道研究所有这些需要多少时间。

谢谢。

python - Django WSGI 多线程和连接问题

2 回答 2

Related

Reference