1

我正在尝试从网站上抓取一些内容。我最终发现它需要 cookie,所以我用 guzzle cookie 插件解决了这个问题。这很奇怪,因为我无法通过 var_dump 获取内容,但如果我执行“echo”,它会显示页面,这让我觉得有一些动态数据调用,它获取数据。我已经习惯了使用 guzzle 的 api,但不确定我应该处理这个吗?谢谢

如果我使用 domcrawler 我会得到一个错误。

代码 -

   use Symfony\Bundle\FrameworkBundle\Controller\Controller;

   use Symfony\Component\DomCrawler\Crawler;

   use Guzzle\Http\Client;

   use Guzzle\Plugin\Cookie\CookiePlugin;

   use Guzzle\Plugin\Cookie\CookieJar\ArrayCookieJar;

   $cookiePlugin = new CookiePlugin(new ArrayCookieJar());

     $url =  'http://www.myurl.com';
    // Add the cookie plugin to a client
     $client = new Client();

     $client->get();

    $client->addSubscriber($cookiePlugin);

  // Send the request with no cookies and parse the returned cookies
  $client->get($url)->send();

// Send the request again, noticing that cookies are being sent
  $request = $client->get($url);

  $response = $request->send();

 var_dump($response);
 $crawler = new Crawler($response);

  foreach ($crawler as $domElement) {
  print $domElement->filter('a')->links();
   }

错误

    Expecting a DOMNodeList or DOMNode instance, an array, a   
  string,        or     null, but got "Guzzle\Http\Message\Response
4

2 回答 2

4

尝试这个:

对于狂饮 5

$crawler = new Crawler($response->getBody()->getContents());

http://docs.guzzlephp.org/en/latest/http-messages.html#id2 http://docs.guzzlephp.org/en/latest/streams.html#creating-streams

对于狂饮 3

$crawler = new Crawler($response->getBody());

http://guzzle3.readthedocs.org/http-client/response.html#response-body

更新

Guzzle 5 与 getContents 方法的基本用法。

include 'vendor/autoload.php';

use GuzzleHttp\Client;

$client = new Client();
echo $client->get('http://stackoverflow.com')->getBody()->getContents();

其余的在doc中(包括 cookie)。

于 2015-04-27T15:15:35.153 回答
1

如果您实例化您的爬虫对象 $crawler = new Crawler($response);,当您尝试使用对象的任何基于表单或链接的功能/特性时,您将收到各种基于 Uri 的错误Crawler

我建议实例化您的Crawler对象,例如:

$crawler = new Symfony\Component\DomCrawler\Crawler(null, $response->getEffectiveUrl());

$crawler->addContent(
    $response->getBody()->__toString(),
    $response->getHeader('Content-Type')
);

这也是Symfony\Component\BrowswerKit\Client它在createCrawlerFromContent方法中的作用方式。由GoutteSymfony\Component\Browerkit\Client内部使用。

于 2015-04-29T14:31:12.913 回答