61

哪些内置 PHP 函数对网页抓取有用?有哪些好的资源(网络或印刷)可以加快使用 PHP 进行网络抓取?

4

10 回答 10

50

Scraping generally encompasses 3 steps:

  • first you GET or POST your request to a specified URL
  • next you receive the html that is returned as the response
  • finally you parse out of that html the text you'd like to scrape.

To accomplish steps 1 and 2, below is a simple php class which uses Curl to fetch webpages using either GET or POST. After you get the HTML back, you just use Regular Expressions to accomplish step 3 by parsing out the text you'd like to scrape.

For regular expressions, my favorite tutorial site is the following: Regular Expressions Tutorial

My Favorite program for working with RegExs is Regex Buddy. I would advise you to try the demo of that product even if you have no intention of buying it. It is an invaluable tool and will even generate code for your regexs you make in your language of choice (including php).

Usage:

$curl = new Curl(); $html = $curl->get("http://www.google.com");

// now, do your regex work against $html

PHP Class:



<?php

class Curl
{       

    public $cookieJar = "";

    public function __construct($cookieJarFile = 'cookies.txt') {
        $this->cookieJar = $cookieJarFile;
    }

    function setup()
    {


        $header = array();
        $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
        $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
        $header[] =  "Cache-Control: max-age=0";
        $header[] =  "Connection: keep-alive";
        $header[] = "Keep-Alive: 300";
        $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
        $header[] = "Accept-Language: en-us,en;q=0.5";
        $header[] = "Pragma: "; // browsers keep this blank.


        curl_setopt($this->curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7');
        curl_setopt($this->curl, CURLOPT_HTTPHEADER, $header);
        curl_setopt($this->curl,CURLOPT_COOKIEJAR, $this->cookieJar); 
        curl_setopt($this->curl,CURLOPT_COOKIEFILE, $this->cookieJar);
        curl_setopt($this->curl,CURLOPT_AUTOREFERER, true);
        curl_setopt($this->curl,CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($this->curl,CURLOPT_RETURNTRANSFER, true);  
    }


    function get($url)
    { 
        $this->curl = curl_init($url);
        $this->setup();

        return $this->request();
    }

    function getAll($reg,$str)
    {
        preg_match_all($reg,$str,$matches);
        return $matches[1];
    }

    function postForm($url, $fields, $referer='')
    {
        $this->curl = curl_init($url);
        $this->setup();
        curl_setopt($this->curl, CURLOPT_URL, $url);
        curl_setopt($this->curl, CURLOPT_POST, 1);
        curl_setopt($this->curl, CURLOPT_REFERER, $referer);
        curl_setopt($this->curl, CURLOPT_POSTFIELDS, $fields);
        return $this->request();
    }

    function getInfo($info)
    {
        $info = ($info == 'lasturl') ? curl_getinfo($this->curl, CURLINFO_EFFECTIVE_URL) : curl_getinfo($this->curl, $info);
        return $info;
    }

    function request()
    {
        return curl_exec($this->curl);
    }
}

?>

于 2008-09-19T16:40:07.803 回答
15

我推荐Goutte,一个简单的 PHP Web Scraper

示例用法:-

创建一个 Goutte Client 实例(扩展 Symfony\Component\BrowserKit\Client):

use Goutte\Client;

$client = new Client();

使用以下方法发出请求request()

$crawler = $client->request('GET', 'http://www.symfony-project.org/');

request方法返回一个Crawler对象 ( Symfony\Component\DomCrawler\Crawler)。

点击链接:

$link = $crawler->selectLink('Plugins')->link();
$crawler = $client->click($link);

提交表格:

$form = $crawler->selectButton('sign in')->form();
$crawler = $client->submit($form, array('signin[username]' => 'fabien', 'signin[password]' => 'xxxxxx'));

提取数据:

$nodes = $crawler->filter('.error_list');

if ($nodes->count())
{
  die(sprintf("Authentification error: %s\n", $nodes->text()));
}

printf("Nb tasks: %d\n", $crawler->filter('#nb_tasks')->text());
于 2012-05-26T04:08:12.887 回答
11

ScraperWiki是一个非常有趣的项目。帮助您使用 Python、Ruby 或 PHP 在线构建爬虫 - 我能够在几分钟内完成一个简单的尝试。

于 2010-09-24T04:50:43.230 回答
2

如果您需要一些易于维护而不是快速执行的东西,那么使用可编写脚本的浏览器可能会有所帮助,例如SimpleTest 的.

于 2008-09-19T21:49:25.080 回答
2

抓取可能非常复杂,具体取决于您想要做什么。阅读关于在 PHP 中编写 Scraper 的基础教程系列,看看您是否可以掌握它。

您可以使用类似的方法来自动化表单注册、登录,甚至虚假点击广告!使用 CURL 的主要限制是它不支持使用 javascript,所以如果你试图抓取一个使用 AJAX 进行分页的网站,例如它可能会变得有点棘手......但同样有办法解决这个问题!

于 2015-01-22T17:41:33.627 回答
1

这是另一个:没有 Regex 的简单 PHP Scraper

于 2010-06-19T13:41:44.367 回答
0

我会使用 libcurl 或 Perl 的 LWP(libwww for perl)。有没有用于 php 的 libwww?

于 2008-08-25T21:39:43.730 回答
0

file_get_contents()可以获取远程 URL 并为您提供源。然后,您可以使用正则表达式(与 Perl 兼容的函数)来获取您需要的内容。

出于好奇,你想刮什么?

于 2008-08-25T21:31:03.067 回答
0

我的框架中的刮板类:

<?php

/*
    Example:

    $site = $this->load->cls('scraper', 'http://www.anysite.com');
    $excss = $site->getExternalCSS();
    $incss = $site->getInternalCSS();
    $ids = $site->getIds();
    $classes = $site->getClasses();
    $spans = $site->getSpans(); 

    print '<pre>';
    print_r($excss);
    print_r($incss);
    print_r($ids);
    print_r($classes);
    print_r($spans);        

*/

class scraper
{
    private $url = '';

    public function __construct($url)
    {
        $this->url = file_get_contents("$url");
    }

    public function getInternalCSS()
    {
        $tmp = preg_match_all('/(style=")(.*?)(")/is', $this->url, $patterns);
        $result = array();
        array_push($result, $patterns[2]);
        array_push($result, count($patterns[2]));
        return $result;
    }

    public function getExternalCSS()
    {
        $tmp = preg_match_all('/(href=")(\w.*\.css)"/i', $this->url, $patterns);
        $result = array();
        array_push($result, $patterns[2]);
        array_push($result, count($patterns[2]));
        return $result;
    }

    public function getIds()
    {
        $tmp = preg_match_all('/(id="(\w*)")/is', $this->url, $patterns);
        $result = array();
        array_push($result, $patterns[2]);
        array_push($result, count($patterns[2]));
        return $result;
    }

    public function getClasses()
    {
        $tmp = preg_match_all('/(class="(\w*)")/is', $this->url, $patterns);
        $result = array();
        array_push($result, $patterns[2]);
        array_push($result, count($patterns[2]));
        return $result;
    }

    public function getSpans(){
        $tmp = preg_match_all('/(<span>)(.*)(<\/span>)/', $this->url, $patterns);
        $result = array();
        array_push($result, $patterns[2]);
        array_push($result, count($patterns[2]));
        return $result;
    }

}
?>
于 2009-12-26T06:19:02.660 回答
-2

curl 库允许您下载网页。您应该查看用于进行抓取的正则表达式。

于 2008-08-25T21:30:01.040 回答