我正在尝试将 PHPCrawl 与 Symfony2 一起使用。我首先使用 Composer 安装了 PHPCrawl 库,然后在我的包中创建了一个文件夹“DependencyInjection”,其中放置了扩展 PHPCrawler 的类“MyCrawler”。我将其配置为服务。现在,当我启动爬取过程时,Symfony 给了我上述错误:
试图在类“PHPCrawlerUtils”上调用方法“getURIContent”
而且我不知道为什么,因为类存在,并且方法存在。
这是我的控制器操作:
/**
* Parcours le site concerné
*
* @Route("/crawl", name="blog_crawl")
* @Template()
*/
public function crawlAction($url = 'http://urlexample.net')
{
// Au lieu de créer une instance de la classe MyCrawler, je l'appelle en tant que service (config.yml)
$crawl = $this->get('my_crawler');
$crawl->setURL($url);
// Analyse la balise content-type du document, autorise les pages de type text/html
$crawl->addContentTypeReceiveRule("#text/html#");
// Filtre les url trouvées dans la page en question - ici on garde les pages html uniquement
$crawl->addURLFilterRule("#(jpg|gif|png|pdf|jpeg|svg|css|js)$# i");
$crawl->enableCookieHandling(TRUE);
// Sets a limit to the number of bytes the crawler should receive alltogether during crawling-process.
$crawl->setTrafficLimit(0);
// Sets a limit to the total number of requests the crawler should execute.
$crawl->setRequestLimit(20);
// Sets the content-size-limit for content the crawler should receive from documents.
$crawl->setContentSizeLimit(0);
// Sets the timeout in seconds for waiting for data on an established server-connection.
$crawl->setStreamTimeout(20);
// Sets the timeout in seconds for connection tries to hosting webservers.
$crawl->setConnectionTimeout(20);
$crawl->obeyRobotsTxt(TRUE);
$crawl->setUserAgentString("Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0");
$crawl->go();
// At the end, after the process is finished, we print a short
// report (see method getProcessReport() for more information)
$report = $crawl->getProcessReport();
echo "Summary:".'<br/>';
echo "Links followed: ".$report->links_followed.'<br/>';
echo "Documents received: ".$report->files_received.'<br/>';
echo "Bytes received: ".$report->bytes_received." bytes".'<br/>';
echo "Process runtime: ".$report->process_runtime." sec".'<br/>';
echo "Abort reason: ".$report->abort_reason.'<br/>';
return array(
'varstuff' => 'something'
);
}
这是我在 DependencyInjection 文件夹中的服务类 MyCrawler:
<?php
namespace AppBundle\DependencyInjection;
use PHPCrawler;
use PHPCrawlerDocumentInfo;
/**
* Description of MyCrawler
*
* @author Norman
*/
class MyCrawler extends PHPCrawler{
/**
* Récupère les infos d'une url
*
* @param PHPCrawlerDocumentInfo $pageInfo
*/
public function handleDocumentInfo(PHPCrawlerDocumentInfo $pageInfo)
{
$page_url = $pageInfo->url;
$source = $pageInfo->source;
$status = $pageInfo->http_status_code;
// Si page "OK" (pas de code erreur) et non vide, affiche l'url
if($status == 200 && $source!=''){
echo $page_url.'<br/>';
flush();
}
}
}
我还在 sourceforge PHPCrawl 论坛上搜索了帮助,但到目前为止没有成功......我应该补充一点,我正在使用 PHPCrawl 0.83 从这里开始:
https://github.com/mmerian/phpcrawl/
这是问题似乎出现的类:
<?php
/**
* Class for parsing robots.txt-files.
*
* @package phpcrawl
* @internal
*/
class PHPCrawlerRobotsTxtParser
{
public function __construct()
{
// Init PageRequest-class
if (!class_exists("PHPCrawlerHTTPRequest")) include_once($classpath."/PHPCrawlerHTTPRequest.class.php");
$this->PageRequest = new PHPCrawlerHTTPRequest();
}
/**
* Parses a robots.txt-file and returns regular-expression-rules corresponding to the containing "disallow"-rules
* that are adressed to the given user-agent.
*
* @param PHPCrawlerURLDescriptor $BaseUrl The root-URL all rules from the robots-txt-file should relate to
* @param string $user_agent_string The useragent all rules from the robots-txt-file should relate to
* @param string $robots_txt_uri Optional. The location of the robots.txt-file as URI.
* If not set, the default robots.txt-file for the given BaseUrl gets parsed.
*
* @return array Numeric array containing regular-expressions for each "disallow"-rule defined in the robots.txt-file
* that's adressed to the given user-agent.
*/
public function parseRobotsTxt(PHPCrawlerURLDescriptor $BaseUrl, $user_agent_string, $robots_txt_uri = null)
{
PHPCrawlerBenchmark::start("processing_robotstxt");
// If robots_txt_uri not given, use the default one for the given BaseUrl
if ($robots_txt_uri === null)
$robots_txt_uri = self::getRobotsTxtURL($BaseUrl->url_rebuild);
// Get robots.txt-content
$robots_txt_content = PHPCrawlerUtils::getURIContent($robots_txt_uri, $user_agent_string);
$non_follow_reg_exps = array();
// If content was found
if ($robots_txt_content != null)
{
// Get all lines in the robots.txt-content that are adressed to our user-agent.
$applying_lines = $this->getUserAgentLines($robots_txt_content, $user_agent_string);
// Get valid reg-expressions for the given disallow-pathes.
$non_follow_reg_exps = $this->buildRegExpressions($applying_lines, PHPCrawlerUtils::getRootUrl($BaseUrl->url_rebuild));
}
PHPCrawlerBenchmark::stop("processing_robots.txt");
return $non_follow_reg_exps;
}