php - 网络爬虫不跟随页面的链接

Question

我构建了一个简单的网络爬虫，它在<body>. 它做得很好，唯一的问题是它没有跟随页面上的链接转到它递归爬取的其他页面。

在输出中，我只看到手动启动爬虫的页面内容，没有迹象表明它正在跟踪链接。

我怎样才能让它跟随链接、爬取这些页面并回显它们的内容？

这是代码：

<?php
error_reporting( E_ERROR );

define( "CRAWL_LIMIT_PER_DOMAIN", 50 );

$domains = array();

$urls = array();

function crawl( $url )
{
    global $domains, $urls;
    $parse = parse_url( $url );
    $domains[ $parse['host'] ]++;
    $urls[] = $url;

    $content = file_get_contents( $url );
    if ( $content === FALSE ){
        return;
    }

    $content = stristr($content, "<body>");
    preg_match_all( '/http:\/\/[^ "\']+/', $content, $matches );

    // do something with content.
    echo strip_tags($content);

    foreach( $matches[0] as $crawled_url ) {
        $parse = parse_url( $crawled_url );
        if ( count( $domains[ $parse['host'] ] ) < CRAWL_LIMIT_PER_DOMAIN && !in_array(   $crawled_url, $urls ) ) {
            sleep( 1 );
            crawl( $crawled_url );
        }
    }
}

crawl('http://the-irf.com/hello/hello6.html');
?>

更新：我假设正则表达式（ /http://[^ "\']+/ ）有问题。如何实现一个遵循所有锚点的所有 href 的正则表达式，无论它们以

http://
http:/www.
www.
https://
https://www.

或其他任何东西（例如像 index.html 这样的绝对文件路径）？或者有没有更好的方法来做这个正则表达式？

score 1 · Accepted Answer

你应该（像往常一样）首先下定决心你实际上在做什么。

正如您在问题中概述的那样，您正在对 HTTP 协议的 URL 模式进行文本搜索。一个常见的正则表达式通常也包括https:URI 方案：

~https?://\S*~

这就是第一个空格之前的所有内容。这通常可以在字符串中检测更大范围的 HTTP URL。如果您需要更高级的内容，请参阅 Stackover Q&A 关于使文本链接可点击的内容：

这仍然不能解决您所有的爬虫问题。有两个原因：

Character encoding: If you want to properly do that, you need to know the correct character encoding of the string and make the regular expression fitting for it.
That is text. Websites not only consist of text but also of HTML which carries its own semantics.

So actually doing text-analysis alone is not enough. You also need to parse HTML. That means you need to take the Base URI and resolve each other URI inside the document against it to obtain the list of all absolute links in that document.

You find this outlined in the following whitepaper:

5. Reference Resolution in RFC3986: Uniform Resource Identifier (URI): Generic Syntax

For PHP the two most stable components to work with for this are:

DOMDocument - A PHP extension to parse XML and HTML documents. Here you are looking for parsing HTML documents naturally.
Net_Url2 - A PEAR extension to deal with URLs including RFC3986 conform reference resolution (the differences to the previous version you can safely ignore, the standard is pretty stable as the PHP library is, two minor bugs in very narrow and specific cases are still open but have patches).

php - 网络爬虫不跟随页面的链接

1 回答 1

Related

Reference