python - 对接受 zip 文件进行处理的守护程序的建议

Question

我想写一个守护进程：

从包含 zip 文件路径的队列（sqs、rabbit-mq 等）中读取消息
更新数据库中的记录，例如“此作业正在处理”
读取上述存档的内容并将一行插入数据库，其中包含从找到的每个文件的文件元数据中挑选出的信息
将每个文件复制到 s3
删除 zip 文件
将作业标记为“完成”
读取队列中的下一条消息，重复

这应该作为服务运行，并由当有人通过 Web 前端上传文件时排队的消息启动。上传者不需要立即看到结果，但上传会在后台相当方便地处理。

我对 python 很流利，所以首先想到的是编写一个带 twisted 的简单服务器来处理每个请求并执行上述过程。但是，我从来没有写过这样可以在多用户上下文中运行的东西。它不会每分钟或每小时处理数百次上传，但如果它可以一次处理几个，那就太好了，合理的。我也不太熟悉编写多线程应用程序和处理阻塞等问题。

人们过去是如何解决这个问题的？我可以采取哪些其他方法？

提前感谢您的帮助和讨论！

score 1 · Accepted Answer

我将Beanstalkd用作队列守护程序，效果非常好（一些近时间处理和图像大小调整 - 过去几周到目前为止已超过 200 万）。使用 zip 文件名（可能来自特定目录）将消息扔到队列中 [我在 JSON 中序列化命令和参数]，当您在工作客户端中保留消息时，除非您允许，否则没有其他人可以得到它它超时（当它返回队列被拾取时）。

剩下的就是解压和上传到 S3，还有其他库。

如果您想一次处理多个 zip 文件，请运行任意数量的工作进程。

score 1 · Accepted Answer

我会避免做任何多线程的事情，而是使用队列和数据库来同步尽可能多的工作进程，只要你愿意启动。

For this application I think twisted or any framework for creating server applications is going to be overkill.

Keep it simple. Python script starts up, checks the queue, does some work, checks the queue again. If you want a proper background daemon you might want to just make sure you detach from the terminal as described here: How do you create a daemon in Python?

Add some logging, maybe a try/except block to email out failures to you.

score 1 · Accepted Answer

i opted to use a combination of celery (http://ask.github.com/celery/introduction.html), rabbitmq, and a simple django view to handle uploads. the workflow looks like this:

django view accepts, stores upload
a celery Task is dispatched to process the upload. all work is done inside the Task.

python - 对接受 zip 文件进行处理的守护程序的建议

3 回答 3

Related

Reference