0

假设我有超过 10,000 个提要要定期获取/解析。如果时间段是 1 小时,那将是 24x10000 = 240,000 次获取。

labs Task Queue API 当前的 10k 限制将阻止一个人在每次提取时设置一个任务。那么如何做到这一点呢?

更新: RE:获取每个任务的 nurls - 考虑到每个请求在某些时候超时 30 秒,这将达到上限。无论如何要并行化它,所以每个任务队列都会启动一堆异步并行获取,每个都需要不到 30 秒的时间来完成,但很多一起可能需要更多时间。

4

3 回答 3

3

Here's the asynchronous urlfetch API:

http://code.google.com/appengine/docs/python/urlfetch/asynchronousrequests.html

Set of a bunch of requests with a reasonable deadline (give yourself some headroom under your timeout, so that if one request times out you still have time to process the others). Then wait on each one in turn and process as they complete.

I haven't used this technique myself in GAE, so you're on your own finding any non-obvious gotchas. Sadly there doesn't seem to be a select() style call in the API to wait for the first of several requests to complete.

于 2009-07-18T23:28:57.370 回答
2

每个任务 2 次获取?3?

于 2009-07-18T22:10:25.627 回答
0

将提取分组,因此不是排队 1 提取,而是排队,例如,一个执行 10 提取的工作单元。

于 2009-07-18T22:16:01.560 回答