php - 在控制器中使用 Goutte 和 Symfony2

Question

我正在尝试抓取页面并且我对php框架不是很熟悉，所以我一直在尝试学习Symfony2。我已经启动并运行它，现在我正在尝试使用 Goutte。它安装在供应商文件夹中，我有一个用于我的抓取项目的包。

问题是，从 a 中进行刮擦是一种好习惯Controller吗？如何？我一直在搜索，无法弄清楚如何Goutte从包中使用，因为它深深地埋在文件结构中。

<?php

namespace ontf\scraperBundle\Controller;

use Symfony\Bundle\FrameworkBundle\Controller\Controller;
use Goutte\Client;

class ThingController extends Controller
{
  public function somethingAction($something)
  {

    $client = new Client();
    $crawler = $client->request('GET', 'http://www.symfony.com/blog/');
    echo $crawler->text();


    return $this->render('scraperBundle:Thing:index.html.twig');

    // return $this->render('scraperBundle:Thing:index.html.twig', array(
    //     'something' => $something
    //     ));
  }

}

score 4 · Accepted Answer

我不确定我是否听说过关于抓取的“良好做法”，但您可以在PHP Architect's Guide to Web Scraping with PHP中找到一些。

这些是我在自己的项目中使用的一些准则：

抓取是一个缓慢的过程，请考虑将该任务委托给后台进程。
后台进程通常作为执行 CLI 应用程序的 cron 作业或不断运行的工作程序运行。
使用过程控制系统来管理您的工人。看看supervisord
保存每个抓取的文件（“原始”版本），并记录每个错误。这将使您能够发现问题。使用 Rackspace Cloud Files 或 AWS S3 归档这些文件。
使用Symfony2 控制台工具来创建命令来运行你的爬虫。您可以将命令保存在命令目录下的包中。
使用以下标志运行您的 Symfony2 命令以防止内存不足：php app/console scraper:run example.com --env=prod --no-debug其中 app/console 是 Symfony2 控制台应用程序所在的位置， scraper:run 是您的命令名称， example.com 是指示您要访问的页面的参数scrape 和 --env=prod --no-debug 是您应该在生产中运行的标志。例如，请参见下面的代码。
将 Goutte Client 注入到您的命令中，如下所示：

Ontf/ScraperBundle/Resources/services.yml

services:
    goutte_client:
        class: Goutte\Client

    scraperCommand:
        class:  Ontf\ScraperBundle\Command\ScraperCommand
        arguments: ["@goutte_client"]
        tags:
            - { name: console.command }

你的命令应该是这样的：

<?php
// Ontf/ScraperBundle/Command/ScraperCommand.php
namespace Ontf\ScraperBundle\Command;

use Symfony\Component\Console\Command\Command;
use Symfony\Component\Console\Input\InputArgument;
use Symfony\Component\Console\Input\InputInterface;
use Symfony\Component\Console\Input\InputOption;
use Symfony\Component\Console\Output\OutputInterface;
use Goutte\Client;

abstract class ScraperCommand extends Command
{
    private $client;

    public function __construct(Client $client)
    {
        $this->client = $client;
        parent::__construct();
    }

    protected function configure()
    {
        ->setName('scraper:run')
            ->setDescription('Run Goutte Scraper.')
            ->addArgument(
                'url',
                InputArgument::REQUIRED,
                'URL you want to scrape.'
            );
    }

    protected function execute(InputInterface $input, OutputInterface $output) 
    {
        $url = $input->getArgument('url');
        $crawler = $this->client->request('GET', $url);
        echo $crawler->text();
    }
}

score 0 · Accepted Answer

如果您想返回响应，您应该使用 Symfony-Controller，例如 html 输出。

如果您只需要在数据库中计算或存储东西的功能，您应该创建一个服务类来表示您的爬虫的功能，例如

class CrawlerService
{
    function getText($url){
        $client = new Client();
        $crawler = $client->request('GET', $url);
        return $crawler->text();
    }

并执行它我会使用控制台命令

如果要返回响应，请使用控制器

php - 在控制器中使用 Goutte 和 Symfony2

2 回答 2

Related

Reference