-1

我有一个包含 10,000 个的数据库adam_id。对于每一个adam_id,我都需要通过 API 拉取信息。

我的表如下所示:

`title`
- adam_id
- success (boolean)
- number_of_tries (# of times success=0 when trying to do the pull down)

这是我要创建的功能:

def pull_down(cursor):
    work_remains = True
    while work_remains:
        cursor.execute("""SELECT adam_id FROM title WHERE success=0 
                          AND number_of_tries < 5 ORDR BY adam_id LIMIT 1""")
        if len(cursor.fetchall()):
            adam_id = cursor.fetchone()[0]
            do_api_call(adam_id)
        else:
            work_remains = False

def do_api_call(adam_id):
    # do api call
    if success:
        cursor.execute("UPDATE title SET success=1 WHERE adam_id = adam_id")
    else:
        cursor.execute("UPDATE title SET number_of_tries+=1 WHERE adam_id=adam_id")

n我将如何使用 python 的多处理功能而不是使用一个同步进程对工作人员执行上述操作?我已经开始查看 Multiprocessing 模块(http://docs.python.org/library/multiprocessing.html),但到目前为止对我来说似乎很难消化。

4

1 回答 1

1

如果工作的主要部分是 api 调用,因为它涉及外部资源,那么这将是您真正想要并行的唯一部分。数据库调用可能非常快。所以你可以试试这个:

  1. adam_id在一个查询中批量获取值
  2. 将 id 放入进程池以执行 API 调用
  3. 获取结果并将其提交到数据库

这是一个粗略的伪代码示例,用于显示逻辑流程:

from multiprocessing import Pool

def pull_down(cursor):
    # get all the data in one query
    count = cursor.execute("""SELECT adam_id FROM title WHERE success=0 
                      AND number_of_tries < 5 ORDR BY adam_id LIMIT 1""")
    if count:
        # Step #1
        adam_id_list = [row[0] for row in cursor.fetchall()]

        # Step #2
        pool = Pool(4)
        results = pool.map(do_api_call, adam_id_list)
        pool.close()

        # Step #3
        update_db(results)

def do_api_call(adam_id):
    # do api call
    success = call_api_with_id(adam_id)
    return (adam_id, success)

def update_db(results):
    # loop over results and built batch queries for the success
    # or failed items

    # (obviously this split up could be optimized)
    succeeded = [result[0] for result in results if result[1]]
    failed = [result[0] for result in results if not result[1]]

    submit_success(succeeded)
    submit_failed(failed)

如果您尝试使数据库调用并行,只会使代码复杂化,因为您必须正确地为每个进程提供它自己的连接,而实际上它不会是数据库拖慢您的速度。

于 2012-08-24T00:20:39.783 回答