php - 如何从需要 cookie 登录的网站中抓取 PHP 中的网站内容？

Question

我的问题是它不仅需要一个基本的cookie，而是要求一个会话cookie，以及随机生成的ID。我认为这意味着我需要使用带有 cookie jar 的网络浏览器模拟器？

我曾尝试使用 Snoopy、Goutte 和其他一些网络浏览器模拟器，但到目前为止我还没有找到有关如何接收 cookie 的教程。我有点绝望了！

谁能给我一个如何在 Snoopy 或 Goutte 中接受 cookie 的示例？

提前致谢！

score 26 · Accepted Answer

您可以在 cURL 中执行此操作，而无需外部“模拟器”。

下面的代码将页面检索到要解析的 PHP 变量中。

设想

有一个打开会话的页面（我们称之为 HOME）。服务器端，如果它在 PHP 中，是第一次调用的（实际上是任何session_start()一个）。在其他语言中，您需要一个特定页面来完成所有会话设置。从客户端来看，它是提供会话 ID cookie 的页面。在 PHP 中，所有会话页面都可以；在其他语言中，登录页面会执行此操作，所有其他人将检查 cookie 是否存在，如果不存在，则不会创建会话，而是将您带到 HOME。

有一个页面 (LOGIN) 生成登录表单并向会话添加关键信息 - “此用户已登录”。在下面的代码中，这是请求会话 ID 的页面。

最后，有 N 个页面存放了要刮的好东西。

所以我们要依次点击 HOME，然后 LOGIN，然后是 GOODIES。在 PHP（实际上是其他语言）中，HOME 和 LOGIN 很可能是同一个页面。或者所有页面可能共享相同的地址，例如在单页应用程序中。

编码

    $url            = "the url generating the session ID";
    $next_url       = "the url asking for session";

    $ch             = curl_init();
    curl_setopt($ch, CURLOPT_URL,    $url);
    // We do not authenticate, only access page to get a session going.
    // Change to False if it is not enough (you'll see that cookiefile
    // remains empty).
    curl_setopt($ch, CURLOPT_NOBODY, True);

    // You may want to change User-Agent here, too
    curl_setopt($ch, CURLOPT_COOKIEFILE, "cookiefile");
    curl_setopt($ch, CURLOPT_COOKIEJAR,  "cookiefile");

    // Just in case
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

    $ret    = curl_exec($ch);

    // This page we retrieve, and scrape, with GET method
    foreach(array(
            CURLOPT_POST            => False,       // We GET...
            CURLOPT_NOBODY          => False,       // ...the body...
            CURLOPT_URL             => $next_url,   // ...of $next_url...
            CURLOPT_BINARYTRANSFER  => True,        // ...as binary...
            CURLOPT_RETURNTRANSFER  => True,        // ...into $ret...
            CURLOPT_FOLLOWLOCATION  => True,        // ...following redirections...
            CURLOPT_MAXREDIRS       => 5,           // ...reasonably...
            CURLOPT_REFERER         => $url,        // ...as if we came from $url...
            //CURLOPT_COOKIEFILE      => 'cookiefile', // Save these cookies
            //CURLOPT_COOKIEJAR       => 'cookiefile', // (already set above)
            CURLOPT_CONNECTTIMEOUT  => 30,          // Seconds
            CURLOPT_TIMEOUT         => 300,         // Seconds
            CURLOPT_LOW_SPEED_LIMIT => 16384,       // 16 Kb/s
            CURLOPT_LOW_SPEED_TIME  => 15,          // 
            ) as $option => $value)
            if (!curl_setopt($ch, $option, $value))
                    die("could not set $option to " . serialize($value));

    $ret = curl_exec($ch);
    // Done; cleanup.
    curl_close($ch);

执行

首先我们要获取登录页面。

我们使用一个特殊的 User-Agent 来介绍我们自己，这样既可以被识别（我们不想激怒网站管理员），也可以欺骗服务器向我们发送为浏览器定制的特定版本的网站。理想情况下，我们使用与我们将用于调试页面的任何浏览器相同的 User-Agent，加上一个后缀，以使检查它是他们正在查看的自动化工具的人清楚（参见 Halfer 的评论） .

    $ua = 'Mozilla/5.0 (Windows NT 5.1; rv:16.0) Gecko/20100101 Firefox/16.0 (ROBOT)';
    $cookiefile = "cookiefile";
    $url1 = "the login url generating the session ID";

    $ch             = curl_init();

    curl_setopt($ch, CURLOPT_URL,            $url1);
    curl_setopt($ch, CURLOPT_USERAGENT,      $ua);
    curl_setopt($ch, CURLOPT_COOKIEFILE,     $cookiefile);
    curl_setopt($ch, CURLOPT_COOKIEJAR,      $cookiefile);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, True);
    curl_setopt($ch, CURLOPT_NOBODY,         False);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, True);
    curl_setopt($ch, CURLOPT_BINARYTRANSFER, True);
    $ret    = curl_exec($ch);

这将检索要求用户/密码的页面。通过检查页面，我们找到所需的字段（包括隐藏的字段）并可以填充它们。该FORM标签告诉我们是否需要继续使用 POST 或 GET。

我们可能想检查表单代码以调整以下操作，因此我们要求 cURL 将页面内容按原样返回到$ret中，并返回页面正文。有时，CURLOPT_NOBODY设置True为仍然足以触发会话创建和 cookie 提交，如果是这样，它会更快。但是CURLOPT_NOBODY("no body") 通过发出HEAD请求而不是GET; 有时HEAD请求不起作用，因为服务器只会对完整的GET.

除了以这种方式检索正文之外，还可以使用真正的Firefox 登录并嗅探使用 Firebug（或带有 Chrome 工具的 Chrome）发布的表单内容；一些网站会尝试使用 Javascript 填充/修改隐藏字段，以便提交的表单不会是您在 HTML 代码中看到的表单。

希望自己的网站 不被抓取的网站管理员可能会发送一个带有时间戳的隐藏字段。一个人（没有借助过于聪明的浏览器 - 有办法告诉浏览器不要聪明；最坏的情况是，每次更改用户名并传递字段时）至少需要三秒钟才能填写表格。cURL 脚本取零。当然，可以模拟延迟。都是太极拳...

我们可能还想留意表格的外观。例如，网站管理员可以构建一个询问姓名、电子邮件和密码的表单；然后，通过使用 CSS，将“电子邮件”字段移动到您希望找到名称的位置，反之亦然。因此，提交的真实表单在名为的字段中将有一个“@”，而在名为username的字段中则没有email。期望这一点的服务器只是再次反转这两个字段。手工构建的“刮板”（或垃圾邮件机器人）会做看起来很自然的事情，并在email现场发送电子邮件。通过这样做，它背叛了自己。通过使用真正的 CSS 和 JS 感知浏览器处理表单一次，发送有意义的数据，并嗅探实际发送的内容，我们可能能够克服这个特殊的障碍。可能，因为有一些使生活变得困难的方法。正如我所说，太极拳。

回到手头的案例，在这种情况下，表单包含三个字段并且没有 Javascript 覆盖。我们有cPASS, cUSR, 和checkLOGIN'Check login' 的值。

因此，我们准备了具有适当字段的表单。请注意，表单将作为发送application/x-www-form-urlencoded，这在 PHP cURL 中意味着两件事：

我们要使用CURLOPT_POST
选项 CURLOPT_POSTFIELDS 必须是一个字符串（一个数组会发出 cURL 以作为提交的信号multipart/form-data，这可能有效......也可能无效）。

正如它所说，表单字段是urlencoded；有一个功能。

我们阅读action表格的字段；这是我们用来提交身份验证的 URL（我们必须拥有）。

于是一切准备就绪……

    $fields = array(
        'checkLOGIN' => 'Check Login',
        'cUSR'       => 'jb007',
        'cPASS'      => 'astonmartin',
    );
    $coded = array();
    foreach($fields as $field => $value)
        $coded[] = $field . '=' . urlencode($value);
    $string = implode('&', $coded);

    curl_setopt($ch, CURLOPT_URL,         $url1); //same URL as before, the login url generating the session ID
    curl_setopt($ch, CURLOPT_POST,        True);
    curl_setopt($ch, CURLOPT_POSTFIELDS,  $string);
    $ret    = curl_exec($ch);

我们现在期待“你好，詹姆斯 - 来一场精彩的国际象棋比赛怎么样？” 页。但更重要的是，我们期望与保存在 cookie 中的 cookie 关联的会话$cookiefile已经提供了关键信息—— “用户已通过身份验证”。

因此，使用同一个 cookie jar 发出的所有后续页面请求都$ch将被授予访问权限，使我们能够非常轻松地“抓取”页面 - 只需记住将请求模式设置回GET：

    curl_setopt($ch, CURLOPT_POST,        False);

    // Start spidering
    foreach($urls as $url)
    {
        curl_setopt($ch, CURLOPT_URL, $url);
        $HTML = curl_exec($ch);
        if (False === $HTML)
        {
            // Something went wrong, check curl_error() and curl_errno().
        }
    }
    curl_close($ch);

在循环中，您可以访问$HTML每个页面的 HTML 代码。

使用正则表达式的诱惑很大。你必须抵制它。为了更好地应对不断变化的 HTML，以及确保在布局保持不变但内容发生变化时不会出现误报或漏报（例如，您发现您有尼斯、Tourrette-Levens 的天气预报， Castagniers，但从来没有 Asprémont 或 Gattières，这不是很奇怪吗？），最好的选择是使用 DOM：

获取 A 元素的 href 属性

score 1 · Accepted Answer

面向对象的答案

我们在一个应该提供正常导航功能的类中尽可能多地实现前面的答案。Browser

然后，我们应该能够以非常简单的形式将特定于站点的代码放入一个新的派生类中，我们称之为，FooBrowser执行站点抓取Foo。

派生浏览器的类必须提供一些特定于站点的功能，例如path()允许存储特定于站点的信息的功能，例如

function path($basename) {
    return '/var/tmp/www.foo.bar/' . $basename;
}

abstract class Browser
{
    private $options = [];
    private $state   = [];
    protected $cookies;

    abstract protected function path($basename);

    public function __construct($site, $options = []) {
        $this->cookies   = $this->path('cookies');
        $this->options  = array_merge(
            [
                'site'      => $site,
                'userAgent' => 'Mozilla/5.0 (Windows NT 5.1; rv:16.0) Gecko/20100101 Firefox/16.0 - LeoScraper',
                'waitTime'  => 250000,
            ],
            $options
        );
        $this->state = [
            'referer' => '/',
            'url'     => '',
            'curl'    => '',
        ];
        $this->__wakeup();
    }

    /**
     * Reactivates after sleep (e.g. in session) or creation
     */
    public function __wakeup() {
        $this->state['curl'] = curl_init();
        $this->config([
            CURLOPT_USERAGENT       => $this->options['userAgent'],
            CURLOPT_ENCODING        => '',
            CURLOPT_NOBODY          => false,
            // ...retrieving the body...
            CURLOPT_BINARYTRANSFER  => true,
            // ...as binary...
            CURLOPT_RETURNTRANSFER  => true,
            // ...into $ret...
            CURLOPT_FOLLOWLOCATION  => true,
            // ...following redirections...
            CURLOPT_MAXREDIRS       => 5,
            // ...reasonably...
            CURLOPT_COOKIEFILE      => $this->cookies,
            // Save these cookies
            CURLOPT_COOKIEJAR       => $this->cookies,
            // (already set above)
            CURLOPT_CONNECTTIMEOUT  => 30,
            // Seconds
            CURLOPT_TIMEOUT         => 300,
            // Seconds
            CURLOPT_LOW_SPEED_LIMIT => 16384,
            // 16 Kb/s
            CURLOPT_LOW_SPEED_TIME  => 15,
        ]);
    }

    /**
     * Imports an options array.
     *
     * @param array $opts
     * @throws DetailedError
     */
    private function config(array $opts = []) {
        foreach ($opts as $key => $value) {
            if (true !== curl_setopt($this->state['curl'], $key, $value)) {
                throw new \Exception('Could not set cURL option');
            }
        }
    }

    private function perform($url) {
        $this->state['referer'] = $this->state['url'];
        $this->state['url'] = $url;
        $this->config([
            CURLOPT_URL     => $this->options['site'] . $this->state['url'],
            CURLOPT_REFERER => $this->options['site'] . $this->state['referer'],
        ]);
        $response = curl_exec($this->state['curl']);
        // Should we ever want to randomize waitTime, do so here.
        usleep($this->options['waitTime']);

        return $response;
    }

    /**
     * Returns a configuration option.
     * @param string $key       configuration key name
     * @param string $value     value to set
     * @return mixed
     */
    protected function option($key, $value = '__DEFAULT__') {
        $curr   = $this->options[$key];
        if ('__DEFAULT__' !== $value) {
            $this->options[$key]    = $value;
        }
        return $curr;
    }

    /**
     * Performs a POST.
     *
     * @param $url
     * @param $fields
     * @return mixed
     */
    public function post($url, array $fields) {
        $this->config([
            CURLOPT_POST       => true,
            CURLOPT_POSTFIELDS => http_build_query($fields),
        ]);
        return $this->perform($url);
    }

    /**
     * Performs a GET.
     *
     * @param       $url
     * @param array $fields
     * @return mixed
     */
    public function get($url, array $fields = []) {
        $this->config([ CURLOPT_POST => false ]);
        if (empty($fields)) {
            $query = '';
        } else {
            $query = '?' . http_build_query($fields);
        }
        return $this->perform($url . $query);
    }
}

现在抓取 FooSite：

/* WWW_FOO_COM requires username and password to construct */

class WWW_FOO_COM_Browser extends Browser
{
    private $loggedIn   = false;

    public function __construct($username, $password) {
        parent::__construct('http://www.foo.bar.baz', [
            'username'  => $username,
            'password'  => $password,
            'waitTime'  => 250000,
            'userAgent' => 'FooScraper',
            'cache'     => true
        ]);
        // Open the session
        $this->get('/');
        // Navigate to the login page
        $this->get('/login.do');
    }

    /**
     * Perform login.
     */
    public function login() {
        $response = $this->post(
            '/ajax/loginPerform',
            [
                'j_un'    => $this->option('username'),
                'j_pw'    => $this->option('password'),
            ]
        );
        // TODO: verify that response is OK.
        // if (!strstr($response, "Welcome " . $this->option('username'))
        //     throw new \Exception("Bad username or password")
        $this->loggedIn = true;
        return true;
    }

    public function scrape($entry) {
        // We could implement caching to avoid scraping the same entry
        // too often. Save $data into path("entry-" . md5($entry))
        // and verify the filemtime of said file, is it newer than time()
        // minus, say, 86400 seconds? If yes, return file_get_content and
        // leave remote site alone.
        $data = $this->get(
            '/foobars/baz.do',
            [
                'ticker' => $entry
            ]
        );
        return $data;
    }

现在实际的抓取代码将是：

    $scraper = new WWW_FOO_COM_Browser('lserni', 'mypassword');
    if (!$scraper->login()) {
        throw new \Exception("bad user or pass");
    }
    // www.foo.com is a ticker site, we need little info for each
    // Other examples might be much more complex.
    $entries = [
        'APPL', 'MSFT', 'XKCD'
    ];
    foreach ($entries as $entry) {
        $html = $scraper->scrape($entry);
        // Parse HTML
    }

强制注意事项：使用合适的解析器从原始 HTML 中获取数据。

php - 如何从需要 cookie 登录的网站中抓取 PHP 中的网站内容？

2 回答 2

设想

编码

执行

面向对象的答案

Related

Reference