php - 通过 curl 匿名从 twitter 或 facebook 获取页面

Question

我正在尝试制作某种页面解析器（更具体 - 突出显示页面上的一些单词），但我遇到了一些问题。我正在使用 curl 从 url 获取整个页面数据，并且大多数页面都很好地合作，而其他页面则没有。

我的目标是像浏览器一样获取所有页面 html，并且我正在尝试匿名使用它 - 就像浏览器一样。我的意思是-如果某些页面需要登录以显示我不感兴趣的浏览器的数据。问题是我无法访问可以从常规浏览器匿名访问的 Twitter 或 Facebook 页面，即使我设置了所有标题，就像它们通常从 Firefox 或 Chrome 发送一样。

有没有办法简单地模拟浏览器从这些方面获取页面，或者我必须使用 OAuth（有人可以解释为什么浏览器不需要使用它）？

编辑我得到了解决方案！如果有人对此有疑问，您应该：
-> 尝试将协议从 https 切换到 http
-> 如果 url 中有一个 /#!/ 元素，请去掉
-> 对于我的 curl 元素“Accept-Encoding: gzip,放气”也引起了问题..不知道为什么，但现在一切都好

我的代码：

if (substr($this->url,0,5) == 'https')
        $this->url = str_replace('https://', 'http://', $this->url);

    $this->url = str_replace('/#!/', '/', $this->url);

    //check, if a valid url is provided
    if(!filter_var($this->url, FILTER_VALIDATE_URL))
        return false;

    $curl = curl_init();

    $header = array();
    $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
    $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
    // -> gives an error: $header[] = "Accept-Encoding: gzip, deflate";
    $header[] = "Accept-Language: pl,en-us;q=0.7,en;q=0.3";
    $header[] = "Cache-Control: max-age=0";
    $header[] = "Connection: keep-alive";
    $header[] = "Keep-Alive: 300";
    $header[] = "Pragma: "; // browsers keep this blank. 
    curl_setopt($curl, CURLOPT_HTTPHEADER,$header);
    curl_setopt($curl, CURLOPT_HEADER, false);

    curl_setopt($curl, CURLOPT_URL, $this->url);

    curl_setopt($curl, CURLOPT_COOKIEJAR, "cookie.txt");
    curl_setopt($curl, CURLOPT_COOKIEFILE, "cookie.txt");
    curl_setopt($curl, CURLOPT_CONNECTTIMEOUT,10);
    curl_setopt($curl, CURLOPT_COOKIESESSION,true);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER,1);
    curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7 (.NET CLR 3.5.30729)');
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);

    $response = curl_exec($curl);
    curl_close($curl);

    if ($response) return $response;

    return false;

一切都在课堂上，但您可以非常轻松地提取代码。对我来说，（推特和脸书）都很好。

score 3 · Accepted Answer

是的，这可以模拟浏览器：但是您需要仔细观察浏览器发送的所有 http 标头（包括 cookie），并且还要处理重定向。其中一些可以通过 cUrl 函数“自动化”，其余的则需要手动处理。

注意：我不是在谈论代码中的 HTML 标头；这些是浏览器发送和接收的 HTTP 标头。

发现这些的最简单方法是让用户提琴手来监控流量。选择一个 URL 并在右侧查找“检查元素”，您将看到发送的标头和接收的标头。

Facebook 使用大量 iFrame 使这变得更加复杂，所以我建议您从更简单的网站开始！

score 0 · Accepted Answer

我得到了解决方案！如果有人对此有疑问，您应该：
-> 尝试将协议从 https 切换到 http
-> 如果 url 中有一个 /#!/ 元素，请去掉
-> 对于我的 curl 元素“Accept-Encoding: gzip,放气”也引起了问题..不知道为什么，但现在一切都好

我的代码：

if (substr($this->url,0,5) == 'https')
        $this->url = str_replace('https://', 'http://', $this->url);

    $this->url = str_replace('/#!/', '/', $this->url);

    //check, if a valid url is provided
    if(!filter_var($this->url, FILTER_VALIDATE_URL))
        return false;

    $curl = curl_init();

    $header = array();
    $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
    $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
    // -> gives an error: $header[] = "Accept-Encoding: gzip, deflate";
    $header[] = "Accept-Language: pl,en-us;q=0.7,en;q=0.3";
    $header[] = "Cache-Control: max-age=0";
    $header[] = "Connection: keep-alive";
    $header[] = "Keep-Alive: 300";
    $header[] = "Pragma: "; // browsers keep this blank. 
    curl_setopt($curl, CURLOPT_HTTPHEADER,$header);
    curl_setopt($curl, CURLOPT_HEADER, false);

    curl_setopt($curl, CURLOPT_URL, $this->url);

    curl_setopt($curl, CURLOPT_COOKIEJAR, "cookie.txt");
    curl_setopt($curl, CURLOPT_COOKIEFILE, "cookie.txt");
    curl_setopt($curl, CURLOPT_CONNECTTIMEOUT,10);
    curl_setopt($curl, CURLOPT_COOKIESESSION,true);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER,1);
    curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7 (.NET CLR 3.5.30729)');
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);

    $response = curl_exec($curl);
    curl_close($curl);

    if ($response) return $response;

    return false;

一切都在课堂上，但您可以非常轻松地提取代码。对我来说，（推特和脸书）都很好。

php - 通过 curl 匿名从 twitter 或 facebook 获取页面

2 回答 2

Related

Reference