我正在尝试构建一个使用 Pytorch 模型的 API。但是,一旦我增加到WEB_CONCURRENCY
1 以上,它会创建比预期更多的线程,并且速度会大大降低,即使发送单个请求也是如此。
示例代码:
api.sh
export WEB_CONCURRENCY=2
python api.py
api.py
from starlette.applications import Starlette
from starlette.responses import UJSONResponse
from starlette.middleware.gzip import GZipMiddleware
from mymodel import Model
model = Model()
app = Starlette(debug=False)
app.add_middleware(GZipMiddleware, minimum_size=1000)
@app.route('/process', methods=['GET', 'POST', 'HEAD'])
async def add_styles(request):
if request.method == 'GET':
params = request.query_params
elif request.method == 'POST':
params = await request.json()
elif request.method == 'HEAD':
return UJSONResponse([], headers=response_header)
print('===Request body===')
print(params)
model_output = model(params.get('data', [])) # It is very simplified. Inside there are
# many things that are happening, which
# involve file reading/writing
# and spawning processes with `popen` that
# do even more processing. But I don't
# think that should be an issue here.
return model_output
if __name__ == '__main__':
uvicorn.run('api:app', host='0.0.0.0', port=int(os.environ.get('PORT', 8080)))
在WEB_CONCURRENCY=1
api.sh 中,nvidia-smi
运行时只看到 1 个 python 进程,模型使用 1.2GB 或 VRAM。请求大约需要 0.7 秒
在WEB_CONCURRENCY=2
api.sh 中,可以看到超过 8 个 python 进程nvidia-smi
,它们将使用超过 ~8GB 的 VRAM。如果幸运并且没有出现内存不足错误,那么单个请求最多可能需要 3 秒。
我正在使用 Python3.8
为什么 Pytorch 不使用预期的 2.4GB VRAM 时WEB_CONCURRENCY=2
?为什么它会减速这么多?