mysql - 并行运行处理器时，我应该如何选择下一个要处理的项目？

Question

我在没有数据库细节的情况下问这个问题，因为感觉答案可能在于通用设计模式，而且我不一定需要特定于系统的解决方案（我的特定系统设置在问题末尾引用）。

我有一个包含 id、url 和处理字段的公司数据库，以指示该公司当前是否正在由我的一个爬虫处理。我并行运行许多爬虫。每个人都需要选择一家公司进行处理，并在开始之前将该公司设置为处理中，这样每个公司在任何给定时间都只被单个爬虫处理。

我应该如何构建我的系统来跟踪正在处理的公司？

这里的挑战是我无法在我的数据库中搜索未处理的公司，然后更新该公司以将其设置为已处理，因为同时另一个爬虫可能已经选择了它。这似乎是并行处理数据时的常见问题，因此我正在寻找理论上的最佳实践。

我曾经为此使用 MySQL，并使用以下代码来保持处理器之间的一致性。然而，我正在重新设计系统，现在 ElasticSearch 将成为我的主要数据库和搜索服务器。下面的 MySQL 解决方案对我来说总是感觉像是一个 hack，而不是这个并行化问题的正确解决方案。

public function select_next()
{

    // set a temp variable that allows us to retrieve id of the row that is updated during next query
    $sql = 'SET @update_id := 0';
    $Result = $this->Mysqli->query( $sql );
    if( ! $Result )
        die( "\n\n    " . $this->Mysqli->error . "\n" . $sql );

    // selects next company to be crawled, marks as crawling in the db
    $sql = "UPDATE companies
            SET
                crawling = 1,
                id = ( SELECT @update_id := id )
            WHERE crawling = 0
            ORDER BY last_crawled ASC, id ASC
            LIMIT 1";
    $Result = $this->Mysqli->query( $sql );
    if( ! $Result )
        die( "\n\n    " . $this->Mysqli->error . "\n" . $sql );

    // this query returned at least one result and there are companies to be crawled
    if( $this->Mysqli->affected_rows > 0 )
    {

        // gets the id of the row that was just updated in the previous query
        $sql = 'SELECT @update_id AS id';
        $Result = $this->Mysqli->query( $sql );
        if( ! $Result )
            die( "\n\n    " . $this->Mysqli->error . "\n" . $sql );

        // set company id
        $this->id = $Result->fetch_object()->id;

    }

}

score 1 · Accepted Answer

经常用于此类问题的一种方法是分片。您可以定义一个确定性函数，将数据库中的每一行分配给爬虫。在您的情况下，这样的功能可以简单地是一个公司 id 模数爬虫。每个爬虫可以顺序处理属于这个工作分片的公司，这保证了不会同时处理任何公司。

这种方法例如被MapReduce的 Reduce 部分使用。

一个优点是不需要难以实现并且通常是瓶颈的事务或锁定，尤其是在分布式环境中。缺点是爬虫之间的工作分配不均，在这种情况下，一些爬虫处于空闲状态，而其他爬虫仍在处理。

mysql - 并行运行处理器时，我应该如何选择下一个要处理的项目？

1 回答 1

Related

Reference