php - 通过为要遵循的每个链接启动 pthread 来使用 PHP 和 Goutte 抓取网页

Question

我一直在尝试使用 PHP 和 Goutte 实现刮板。如果我只使用一个线程并按顺序刮掉所有东西，一切都会很好。为了加快这个过程，我希望：

加载初始页面，其中包含几个指向需要抓取的页面的链接以及用于分页的链接。
为需要抓取的每个链接启动不同的 pthread，以便抓取可以并行进行。
一旦从主线程启动了该页面的所有 pthread，主线程将导航到下一页并为该页面中的所有链接重复启动 pthread。

当然，我正在计划一种机制，以确保一次运行的线程数量不超过一定数量。

无论如何，我的问题目前与提供 Goutte 客户端和每个线程的链接有关。

显然，Goutte 客户端是不可序列化的，并且不能在线程构造函数中按原样传递，然后克隆，以便每个线程都有自己的 Goutte 客户端实例。

尝试使用原始客户端的克隆分配线程时出现以下错误：

致命错误：第 15 行 D:\users\Oriol\workspace\TravellScrapper\pagescrapers\baseScraper.php 中的未捕获异常 'Exception' 和消息 'Serialization of 'Closure is not allowed'

这是 Thread 类的代码，它试图在其构造函数中克隆 Goutte 客户端。

class baseScrapper extends Thread{

  public function __construct($client,  $link){
    $this->client = new Client();
    $this->client = clone $client;
    $this->link = $link;
    threadThrottle::addThread();
  }

  public function run(){
    $this->crowler = $this->client->click($this->link);
  }

  public function __destruct(){
    threadThrottle::removeThread();
  }
}

关于如何实现这一点的任何建议？我需要在每个线程中复制 Goutte 客户端，以便它包含所有会话信息并且我可以单击链接。

score 2 · Accepted Answer

听起来使用pthreads的原因是异步执行数据通信......

Goutte 基本上将 Guzzle 改编为Symfony\Component\BrowserKit. 在这样做的同时，它抽象了库的异步功能。

您可以使用GuzzleHttp\Pool发出多个并发请求

从文档复制和粘贴的示例）

use GuzzleHttp\Pool;
use GuzzleHttp\Client;
use GuzzleHttp\Psr7\Request;

$client = new Client();

$requests = function ($total) {
    $uri = 'http://127.0.0.1:8126/guzzle-server/perf';
    for ($i = 0; $i < $total; $i++) {
        yield new Request('GET', $uri);
    }
};

$pool = new Pool($client, $requests(100), [
    'concurrency' => 5,
    'fulfilled' => function ($response, $index) {
        // this is delivered each successful response
    },
    'rejected' => function ($reason, $index) {
        // this is delivered each failed request
    },
]);

// Initiate the transfers and create a promise
$promise = $pool->promise();

// Force the pool of requests to complete.
$promise->wait();

Goutte 是一款很棒的软件。但是，对于您的特定用例，异步数据通信来说，单独使用底层库（Guzzle、DomCrawler 甚至 \DomDocument）会更好。

score 1 · Accepted Answer

使用更新版本的 pthread 将导致错误消失，闭包支持包含在最新版本中。

Thread-Per-Request 模型并不理想，听起来线程并不理想；如果您只想同时下载一堆东西，请使用 nbio。

正如其他人所提到的，按照您的建议进行扫描的机器人应该很快就会被禁止进入您打算抓取的任何网站。

php - 通过为要遵循的每个链接启动 pthread 来使用 PHP 和 Goutte 抓取网页

2 回答 2

Related

Reference