php - 如何在 PHP 中实现屏幕刮板？

Question

我有一个用户 ID 和密码，可以通过我的程序登录网站。登录后，URL 将从http://localhost/Test/loginpage.html更改为http://www.4wtech.com/csp/web/Employee/Login.csp。

如何使用 PHP 从第二个 URL 中“筛选”数据？

score 4 · Accepted Answer

你会使用卷曲。Curl 可以登录该页面，然后访问新引用的页面并下载整个页面。

查看curl的 php 手册以及本教程：如何使用 PHP 和 Curl 进行屏幕抓取。

score 3 · Accepted Answer

我不太确定我是否理解你的问题。但如果你真的打算在 PHP 中进行屏幕抓取，我推荐使用simple_html_dom解析器。这是一个小型库，可让您在 PHP 中使用 CSS 选择器。对我来说，屏幕抓取在 PHP 中从未如此简单。这是一个例子：

// Create DOM from URL or file
$html = file_get_html('http://stackoverflow.com/');

// Find all links
foreach($html->find('a') as $element) {
       echo $element->href . '<br>';
}

score 0 · Accepted Answer

为插件道歉，但我已经为屏幕抓取编写了 JS_Extractor。它实际上只是 DOM 扩展的一个非常简单的扩展，带有一些帮助方法使事情变得更容易一些，但它的效果非常好。

score 0 · Accepted Answer

SimpleTest 单元测试框架有一个Scriptable Browser 组件，可以单独使用。我通常将其用于屏幕抓取/机器人，因为它具有模拟浏览器的能力。

score 0 · Accepted Answer

重要的！

请注意，并不总是允许抓取。如果您决定抓取页面，请确保该页面的所有者允许您这样做，否则您最终可能会做一些非法的事情。

假设您被允许抓取页面，请应用以下步骤。

HTTP 请求

首先，您发出 HTTP 请求以获取页面内容。有几种方法可以做到这一点。

开放

发送 HTTP 请求的最基本方法是使用fopen. 一个主要优点是您可以设置一次读取多少个字符，这在读取非常大的文件时很有用。但是，正确地做这件事并不是最容易的事情，除非您正在阅读非常大的文件并且担心遇到内存问题，否则不建议这样做。

$fp = fopen("http://www.4wtech.com/csp/web/Employee/Login.csp", "rb");
if (FALSE === $fp) {
    exit("Failed to open stream to URL");
}

$result = '';

while (!feof($fp)) {
    $result .= fread($fp, 8192);
}
fclose($fp);
echo $result;

文件获取内容

最简单的方法，就是使用file_get_contents. if 和 fopen 差不多，但你可以选择的选项更少。这里的一个主要优点是它只需要一行代码。

$result = file_get_contents('http://www.4wtech.com/csp/web/Employee/Login.csp');
echo $result;

插座

如果您需要对发送到服务器的标头进行更多控制，您可以结合使用套接字和fopen.

$fp = fsockopen("www.4wtech.com/csp/web/Employee/Login.csp", 80, $errno, $errstr, 30);
if (!$fp) {
    $result = "$errstr ($errno)<br />\n";
} else {
    $result = '';
    $out = "GET / HTTP/1.1\r\n";
    $out .= "Host: www.4wtech.com/csp/web/Employee/Login.csp\r\n";
    $out .= "Connection: Close\r\n\r\n";
    fwrite($fp, $out);
    while (!feof($fp)) {
        $result .= fgets($fp, 128);
    }
    fclose($fp);
}
echo $result;

流

或者，您也可以使用流。流类似于套接字，可以与fopen和结合使用file_get_contents。

$opts = array(
  'http'=>array(
    'method'=>"GET",
    'header'=>"Accept-language: en\r\n" .
              "Cookie: foo=bar\r\n"
  )
);

$context = stream_context_create($opts);

$result = file_get_contents('http://www.4wtech.com/csp/web/Employee/Login.csp', false, $context);
echo result;

卷曲

如果您的服务器支持 cURL（通常支持），建议使用 cURL。使用 cURL 的一个关键优势是它依赖于其他编程语言中常用的流行 C 库。它还提供了一种创建请求标头和自动解析响应标头的便捷方式，并提供了一个简单的界面以防出错。

$defaults = array( 
    CURLOPT_URL, "http://www.4wtech.com/csp/web/Employee/Login.csp"
    CURLOPT_HEADER=> 0
);

$ch = curl_init(); 
curl_setopt_array($ch, ($options + $defaults)); 
if( ! $result = curl_exec($ch)) { 
    trigger_error(curl_error($ch)); 
} 
curl_close($ch); 
echo $result;

图书馆

或者，您可以使用许多 PHP 库之一。不过，我不建议使用库，因为它可能会矫枉过正。在大多数情况下，最好在底层使用 cURL 编写自己的 HTTP 类。

HTML解析

PHP 有一种方便的方法可以将任何 HTML 加载到DOMDocument.

$pagecontent = file_get_contents('http://www.4wtech.com/csp/web/Employee/Login.csp');
$doc = new DOMDocument();
$doc->loadHTML($pagecontent);
echo $doc->saveHTML();

不幸的是，PHP 对 HTML5 的支持是有限的。如果您在尝试解析页面内容时遇到错误，请考虑使用第三方库。为此，我可以推荐Masterminds/html5-php。使用此库解析 HTML 文件与使用DOMDocument.

use Masterminds\HTML5;

$pagecontent = file_get_contents('http://www.4wtech.com/csp/web/Employee/Login.csp');
$html5 = new HTML5();
$dom = $html5->loadHTML($html);
echo $html5->saveHTML($dom);

或者，您可以使用例如。我的图书馆PHPPowertools/DOM-Query。它在底层使用Masterminds/html5-php解析 HTML 文件，将 HTML5 字符串解析为 DomDocument，并使用symfony/DomCrawler将 CSS 选择器转换为 XPath 选择器。即使将一个对象传递给另一个对象，它也始终使用相同的 DomDocument，以确保良好的性能。

namespace PowerTools;

// Get file content
$pagecontent = file_get_contents( 'http://www.4wtech.com/csp/web/Employee/Login.csp' );

// Define your DOMCrawler based on file string
$H = new DOM_Query( $pagecontent );

// Define your DOMCrawler based on an existing DOM_Query instance
$H = new DOM_Query( $H->select('body') );

// Passing a string (CSS selector)
$s = $H->select( 'div.foo' );

// Passing an element object (DOM Element)
$s = $H->select( $documentBody );

// Passing a DOM Query object
$s = $H->select( $H->select('p + p') );

// Select the body tag
$body = $H->select('body');

// Combine different classes as one selector to get all site blocks
$siteblocks = $body->select('.site-header, .masthead, .site-body, .site-footer');

// Nest your methods just like you would with jQuery
$siteblocks->select('button')->add('span')->addClass('icon icon-printer');

// Use a lambda function to set the text of all site blocks
$siteblocks->text(function( $i, $val) {
    return $i . " - " . $val->attr('class');
});

// Append the following HTML to all site blocks
$siteblocks->append('<div class="site-center"></div>');

// Use a descendant selector to select the site's footer
$sitefooter = $body->select('.site-footer > .site-center');

// Set some attributes for the site's footer
$sitefooter->attr(array('id' => 'aweeesome', 'data-val' => 'see'));

// Use a lambda function to set the attributes of all site blocks
$siteblocks->attr('data-val', function( $i, $val) {
    return $i . " - " . $val->attr('class') . " - photo by Kelly Clark";
});

// Select the parent of the site's footer
$sitefooterparent = $sitefooter->parent();

// Remove the class of all i-tags within the site's footer's parent
$sitefooterparent->select('i')->removeAttr('class');

// Wrap the site's footer within two nex selectors
$sitefooter->wrap('<section><div class="footer-wrapper"></div></section>');

php - 如何在 PHP 中实现屏幕刮板？

5 回答 5

HTTP 请求

开放

文件获取内容

插座

流

卷曲

图书馆

HTML解析

Related

Reference