0

我正在尝试使用 cURL 和 PHP 从网页上抓取代理。但是,当我使用 cURL 时,我得到的只是 $content 中的 CSS。该页面使用 wordpress,因此它可以动态加载内容,但我没有找到任何可以帮助我下载动态内容的内容。我在 linux 中使用 wget 并且页面下载正常。

    <?php
//$source1 = file_get_contents('http://www.new-fresh-proxies.blogspot.com/');
$source1 = get_data("http://www.new-fresh-proxies.blogspot.com/");

$array = array();
$source1 = preg_grep('/\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\:\d{1,5}\b/', $array);





//download webpage
function get_data($url) {
    $options = array(
            CURLOPT_RETURNTRANSFER => 1,     // return web page
            CURLOPT_HEADER         => true,    // don't return headers
            CURLOPT_FOLLOWLOCATION => true,     // follow redirects
            CURLOPT_ENCODING       => "",       // handle all encodings
            CURLOPT_USERAGENT      => "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13", // who am i
            CURLOPT_AUTOREFERER    => true,     // set referer on redirect
            CURLOPT_CONNECTTIMEOUT => 120,      // timeout on connect
            CURLOPT_TIMEOUT        => 120,      // timeout on response
            CURLOPT_MAXREDIRS      => 50,       // stop after 10 redirects
    );

    $ch      = curl_init( $url );
    curl_setopt_array( $ch, $options );
    $content = curl_exec( $ch );
    $err     = curl_errno( $ch );
    $errmsg  = curl_error( $ch );
    $header  = curl_getinfo( $ch );
    curl_close( $ch );

    $header['errno']   = $err;
    $header['errmsg']  = $errmsg;
    $header['content'] = $content;
    return $header;

我的输出:

    (

string:203221) HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Expires: Wed, 06 Feb 2013 22:09:23 GMT
Date: Wed, 06 Feb 2013 22:09:23 GMT
Cache-Control: private, max-age=0
Last-Modified: Wed, 06 Feb 2013 20:39:30 GMT
ETag: "c6675d47-80ec-48ee-9c0f-613c9419f172"
Content-Encoding: gzip
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
Content-Length: 47132
Server: GSE

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html dir='ltr' xmlns='http://www.w3.org/1999/xhtml' xmlns:b='http://www.google.com/2005/gml/b' xmlns:data='http://www.google.com/2005/gml/data' xmlns:expr='http://www.google.com/2005/gml/expr'>
<head>
<meta content='text/html; charset=UTF-8' http-equiv='Content-Type'/>
<script type="text/javascript">(function() { var a=window,b="jstiming",d="tick";var e=function(c){this.t={};this.tick=function(c,p,h){h=void 0!=h?h:(new Date).getTime();this.t[c]=[h,p]};this[d]("start",null,c)},f=new e;a.jstiming={Timer:e,load:f};if(a.performance&&a.performance.timing){var g=a.performance.timing,j=a[b].load,k=g.navigationStart,l=g.responseStart;0<k&&l>=k&&(j[d]("_wtsrt",void 0,k),j[d]("wtsrt_","_wtsrt",l),j[d]("tbsd_","wtsrt_"))}
try{var m=null;a.chrome&&a.chrome.csi&&(m=Math.floor(a.chrome.csi().pageT),j&&0<k&&(j[d]("_tbnd",void 0,a.chrome.csi().startE),j[d]("tbnd_","_tbnd",k)));null==m&&a.gtbExternal&&(m=a.gtbExternal.pageT());null==m&&a.external&&(m=a.external.pageT,j&&0<k&&(j[d]("_tbnd",void 0,a.external.startE),j[d]("tbnd_","_tbnd",k)));m&&(a[b].pt=m)}catch(n){};a.tickAboveFold=function(c){var i=0;if(c.offsetParent){do i+=c.offsetTop;while(c=c.offsetParent)}c=i;750>=c&&a[b].load[d]("aft")};var q=!1;function r(){q||(q=!0,a[b].load[d]("firstScrollTime"))}a.addEventListener?a.addEventListener("scroll",r,!1):a.attachEvent("onscroll",r);
 })();</script>
<meta content='true' name='MSSmartTagsPreventParsing'/>
<meta content='blogger' name='generator'/>
<link href='http://www.new-fresh-proxies.blogspot.com/favicon.ico' rel='icon' type='image/x-icon'/>
<link href='http://new-fresh-proxies.blogspot.com/' rel='canonical'/>
<link rel="alternate" type="application/atom+xml" title="New Fresh Proxies - Atom" href="http://new-fresh-proxies.blogspot.com/feeds/posts/default" />
<link rel="alternate" type="application/rss+xml" title="New Fresh Proxies - RSS" href="http://new-fresh-proxies.blogspot.com/feeds/posts/default?alt=rss" />
<link rel="service.post" type="application/atom+xml" title="New Fresh Proxies - Atom" href="http://www.blogger.com/feeds/2001908494944967755/posts/default" />
<link rel="EditURI" type="application/rsd+xml" title="RSD" href="http://www.blogger.com/rsd.g?blogID=2001908494944967755" />
<link rel="openid.server" href="http://www.blogger.com/openid-server.g" />
<link rel="openid.delegate" href="http://new-fresh-proxies.blogspot.com/" />
<!--[if IE]> <script> (function() { var html5 = ("abbr,article,aside,audio,canvas,datalist,details," + "figure,footer,header,hgroup,mark,menu,meter,nav,output," + "progress,section,time,video").split(','); for (var i = 0; i < html5.length; i++) { document.createElement(html5[i]); } try { document.execCommand('BackgroundImageCache', false, true); } catch(e) {} })(); </script> <![endif]-->
<title>New Fresh Proxies</title>
<link type='text/css' rel='stylesheet' href='//www.blogger.com/static/v1/widgets/3950009988-widget_css_bundle.css' />
<link type="text/css" rel="stylesheet" href="//www.blogger.com/dyn-css/authorization.css?targetBlogID=2001908494944967755&zx=c6675d47-80ec-48ee-9c0f-613c9419f172"/>
<style id='page-skin-1' type='text/css'><!--
/*
-----------------------------------------------
Theme Name: Harmonika
Theme URL: http://newwpthemes.com/wordpress-theme/harmonika/
Description: Harmonika is a free WordPress theme with options page and supports the post thumbnails. Suitable for any niche.
Author: NewWpThemes.com
Author URI: http://newwpthemes.com/
Version: 1.0
Tags: Ads Ready, Two Columns, Right Sidebar, Fixed Width, Blogging, Options Page
Template Name: Harmonika
Template URI: http://themecraft.net/2010/06/harmonika-blogger-template
Version: 1.0
Author: Theme Craft
Author URI: http://www.themecraft.net
Harmonika Blogger Template comes under a Creative Commons Attribution-Noncommercial-Share Alike 2.5 Malaysia License.
This means it is free to use on your blog, and you must keep the footer link intact, also as a respect to the designer and converter.
We sincerely need your respect to continue our free Blogger template production, thank you.
ThemeCraft.net
----------------------------------------------- */
/* Use this with templates/template-twocol.html */
#navbar-iframe{
height:0px;
visibility:hidden;
display:none;
}
/* -----------------------------------------------------------------------
Blueprint CSS Framework 0.8
http://blueprintcss.org
* Copyright (c) 2007-Present. See LICENSE for more info.
* See README for instructions on how to use Blueprint.
* For credits and origins, see AUTHORS.
* This is a compressed file. See the sources in the 'src' directory.
----------------------------------------------------------------------- */
/* reset.css */
html, body, div, span, object, iframe, h1, h2, h3, h4, h5, h6, p, blockquote, pre, a, abbr, acronym, address, code, del, dfn, em, img, q, dl, dt, dd, ol, ul, li, fieldset, form, label, legend, table, caption, tbody, tfoot, thead, tr, th, td {margin:0;padding:0;border:0;font-weight:inherit;font-style:inherit;font-size:100%;font-family:inherit;}
body {line-height:1.5;}
caption, th, td {text-align:left;font-weight:normal;}
blockquote:before, blockquote:after, q:before, q:after {content:"";}
blockquote, q {quotes:"" "";}
img {max-width: 100%; /* not working in IE6*/}
a img {max-width: 100%; /* not working in IE6*/ border:none;}
/* typography.css */
body {font-size:75%;color:#222;background:#000;font-family:"Helvetica Neue", Arial, Helvetica, sans-serif;}
h1, h2, h3, h4, h5, h6 {font-weight:normal;}
h1 {font-size:3em;line-height:1;margin-bottom:0.5em;}
h2 {font-size:2em;margin-bottom:0.75em;}
h3 {font-size:1.5em;line-height:1;margin-bottom:1em;}
h4 {font-size:1.2em;line-height:1.25;margin-bottom:1.25em;}
h5 {font-size:1em;font-weight:bold;margin-bottom:1.5em;}
h6 {font-size:1em;font-weight:bold;}
h1 img, h2 img, h3 img, h4 img, h5 img, h6 img {margin:0;}
p {margin:0 0 1.5em;}
p img.left {float:left;margin:1.5em 1.5em 1.5em 0;padding:0;}
p img.right {float:right;margin:1.5em 0 1.5em 1.5em;}
a:focus, a:hover {color:#000;}
a {color:#009;text-decoration:underline;}
blockquote {margin:1.5em;color:#666;font-style:italic;}
strong {font-weight:bold;}
em, dfn {font-style:italic;}
dfn {font-weight:bold;}
sup, sub {line-height:0;}
abbr, acronym {border-bottom:1px dotted #666;}
address {margin:0 0 1.5em;font-style:italic;}
del {color:#666;}
pre {margin:1.5em 0;white-space:pre;}
pre, code, tt {font:1em 'andale mono', 'lucida console', monospace;line-height:1.5;}
li ul, li ol {margin:0 1.5em;}
ul, ol {margin:0 1.5em 1.5em 1.5em;}
ul {list-style-type:disc;}
ol {list-style-type:decimal;}
dl {margin:0 0 1.5em 0;}
dl dt {font-weight:bold;}
dd {margin-left:1.5em;}
table {margin-bottom:1.4em;width:100%;}
th {font-weight:bold;}
thead th {background:#c3d9ff;}
tr.even td {background:#e5ecf9;}
tfoot {font-style:italic;}
caption {background:#eee;}
.small {font-size:.8em;margin-bottom:1.875em;line-height:1.875em;}
.large {font-size:1.2em;line-height:2.5em;margin-bottom:1.25em;}
.hide {display:none;}
.quiet {color:#666;}
.loud {color:#000;}
.highlight {background:#ff0;}
.added {background:#060;color:#fff;}
.removed {background:#900;color:#fff;}
.first {margin-left:0;padding-left:0;}
.last {margin-right:0;padding-right:0;}
.top {margin-top:0;padding-top:0;}
.bottom {margin-bottom:0;padding-bottom:0;}
/* forms.css */
label {font-weight:bold;}
fieldset {padding:1.4em;margin:0 0 1.5em 0;border:1px solid #ccc;}
legend {font-weight:bold;font-size:1.2em;}
input.text, input.title, textarea, select {margin:0.5em 0;border:1px solid #bbb;}
input.text:focus, input.title:focus, textarea:focus, select:focus {border:1px solid #666;}
input.text, input.title {width:300px;padding:5px;}
input.title {font-size:1.5em;}
textarea {width:390px;height:250px;padding:5px;}
.error, .notice, .success {padding:.8em;margin-bottom:1em;border:2px solid #ddd;}
.error {background:#FBE3E4;color:#8a1f11;border-color:#FBC2C4;}
.notice {background:#FFF6BF;color:#514721;border-color:#FFD324;}
.success {background:#E6EFC2;color:#264409;border-color:#C6D880;}
.error a {color:#8a1f11;}
.notice a {color:#514721;}
.success a {color:#264409;}
/* grid.css */
.container {width:950px;margin:0 auto;}
.showgrid {background:url(src/grid.png);}
.column, div.span-1, div.span-2, div.span-3, div.span-4, div.span-5, div.span-6, div.span-7, div.span-8, div.span-9, div.span-10, div.span-11, div.span-12, div.span-13, div.span-14, div.span-15, div.span-16, div.span-17, div.span-18, div.span-19, div.span-20, div.span-21, div.span-22, div.span-23, div.span-24 {float:left;margin-right:10px;}
.last, div.last {margin-right:0;}
.span-1 {width:30px;}
.span-2 {width:70px;}
.span-3 {width:110px;}
.span-4 {width:150px;}
.span-5 {width:190px;}
.span-6 {width:230px;}
.span-7 {width:270px;}
.span-8 {width:310px;}
.span-9 {width:350px;}
.span-10 {width:390px;}
.span-11 {width:430px;}
.span-12 {width:470px;}
.span-13 {width:510px;}
.span-14 {width:550px;}
.span-15 {width:590px;}
.span-16 {width:630px;}
.span-17 {width:670px;}
.span-18 {width:710px;}
.span-19 {width:750px;}
.span-20 {width:790px;}
.span-21 {width:830px;}
.span-22 {width:870px;}
.span-23 {width:910px;}
.span-24, div.span-24 {width:950px;margin:0;}
input.span-1, textarea.span-1, input.span-2, textarea.span-2, input.span-3, textarea.span-3, input.span-4, textarea.span-4, input.span-5, textarea.span-5, input.span-6, textarea.span-6, input.span-7, textarea.span-7, input.span-8, textarea.span-8, input.span-9, textarea.span-9, input.span-10, textarea.span-10, input.span-11, te...
4

3 回答 3

3

Curl 无法直接获取它,因为它不会执行 javascript。但如果它来自 ajax 请求,您可以直接向该端点发出请求。

使用开发工具/萤火虫看看发生了什么。

于 2013-02-06T22:16:53.370 回答
2

几件事:

  • 你的“输出”来自哪里?我在您的代码中没有看到任何显示...
  • 我也认为你的preg_grep说法是不正确的。您正在搜索一个空白数组并将结果保存到您刚刚将数据提取到的变量中。尝试:

$array = preg_grep('/\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\:\d{1,5}\b/', $source1);

$source1['content']当我在调用后直接运行代码并转储get_data时,我得到了大量的 IP 地址......

于 2013-02-06T22:21:22.313 回答
1

在我看来,要么是超时,要么是你的正则表达式有问题。

file_get_contents为什么不像你一开始尝试的那样坚持?

$content = file_get_contents('http://www.new-fresh-proxies.blogspot.com.au');

preg_match_all('/(\d+\.\d+\.\d+\.\d+(:\d+)?)/', $content, $matches);

print_r($matches[1]);

这将打印出 IP 列表:

Array
(
    [0] => 1.204.168.15:6673
    [1] => 1.234.45.130:80
    [2] => 1.34.163.101:8080
    [3] => 1.34.29.89:8080
    [4] => 1.34.8.221:3128
    ....

希望有帮助。

于 2013-02-06T22:30:46.237 回答