php - 试图从网页中抓取所有 facebook 链接

Question

我正在尝试从 Facebook 抓取链接页面。但是，我得到一个空白页，没有任何错误消息。

我的代码如下：

<?php
error_reporting(E_ALL);

function getFacebook($html) {

    $matches = array();
    if (preg_match('~^https?://(?:www\.)?facebook.com/(.+)/?$~', $html, $matches)) {
        print_r($matches);

    }
}

$html = file_get_contents('http://curvywriter.info/contact-me/');

getFacebook($html);

它出什么问题了？

score 1 · Accepted Answer

更好的选择（并且更健壮）是使用 DOMDocument 和 DOMXPath：

<?php
error_reporting(E_ALL);

function getFacebook($html) {

    $dom = new DOMDocument;
    @$dom->loadHTML($html);

    $query = new DOMXPath($dom);

    $result = $query->evaluate("(//a|//A)[contains(@href, 'facebook.com')]");

    $return = array();

    foreach ($result as $element) {
        /** @var $element DOMElement */
        $return[] = $element->getAttribute('href');
    }

    return $return;

}

$html = file_get_contents('http://curvywriter.info/contact-me/');

var_dump(getFacebook($html));

但是，对于您的具体问题，我做了以下事情：

更改preg_match为preg_match_all，以便在第一次查找后不停止。
从模式中删除了^（开始）和$（结束）字符。您的链接将出现在文档的中间，而不是在开头或结尾（绝对不能同时出现！）

所以更正的代码：

<?php
error_reporting(E_ALL);

function getFacebook($html) {

    $matches = array();
    if (preg_match_all('~https?://(?:www\.)?facebook.com/(.+)/?~', $html, $matches)) {
        print_r($matches);

    }
}

$html = file_get_contents('http://curvywriter.info/contact-me/');

getFacebook($html);

php - 试图从网页中抓取所有 facebook 链接

1 回答 1

Related

Reference