2

我正在使用 PHP 脚本(使用 cURL)来检查是否:

  • 我数据库中的链接是正确的(即返回 HTTP 状态 200)
  • 链接实际上被重定向并重定向到适当/相似的页面(使用页面的内容)

结果将保存到日志文件中,并作为附件通过电子邮件发送给我。

这一切都很好并且可以正常工作,但是它很慢,而且有一半的时间会超时并提前中止。值得注意的是,我有大约 16,000 个链接要检查。

想知道如何最好地让这个运行更快,我做错了什么?

下面的代码:

function echoappend ($file,$tobewritten) {

        fwrite($file,$tobewritten);
        echo $tobewritten;
}

error_reporting(E_ALL);
ini_set('display_errors', '1');


$filename=date('YmdHis') . "linkcheck.htm";
echo $filename;
$file = fopen($filename,"w+");

try {
        $conn = new PDO('mysql:host=localhost;dbname=databasename',$un,$pw);
        $conn->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
        echo '<b>connected to db</b><br /><br />';

        $sitearray = array("medical.posterous","ebm.posterous","behavenet","guidance.nice","www.rch","emedicine","www.chw","www.rxlist","www.cks.nhs.uk");

        foreach ($sitearray as $key => $value) {    
            $site=$value;

            echoappend ($file, "<h1>" . $site . "</h1>");

            $q="SELECT * FROM link WHERE url LIKE :site";
            $stmt = $conn->prepare($q);
            $stmt->execute(array(':site' => 'http://' . $site . '%'));
            $result = $stmt->fetchAll();

            $totallinks = 0;
            $workinglinks = 0;

            foreach($result as $row)
            {

                $ch = curl_init();
                $originalurl = $row['url'];

                curl_setopt($ch, CURLOPT_URL, $originalurl);
                curl_setopt($ch, CURLOPT_HEADER, 1);
                curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
                curl_setopt($ch, CURLOPT_NOBODY, true);
                curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);


                $output = curl_exec($ch);
                if ($output === FALSE) {
                    echo "cURL Error: " . curl_error($ch);
                }

                $urlinfo = curl_getinfo($ch);

                if ($urlinfo['http_code'] == 200)
                {
                    echoappend($file, $row['name'] . ": <b>working!</b><br />");
                    $workinglinks++;
                }
                else if ($urlinfo['http_code'] == 301 || 302)
                {
                    $redirectch = curl_init();                  
                    curl_setopt($redirectch, CURLOPT_URL, $originalurl);
                    curl_setopt($redirectch, CURLOPT_HEADER, 1);
                    curl_setopt($redirectch, CURLOPT_RETURNTRANSFER, 1);
                    curl_setopt($redirectch, CURLOPT_NOBODY, false);
                    curl_setopt($redirectch, CURLOPT_FOLLOWLOCATION, true);

                    $redirectoutput = curl_exec($redirectch);

                    $doc = new DOMDocument();
                    @$doc->loadHTML($redirectoutput);
                    $nodes = $doc->getElementsByTagName('title');

                    $title = $nodes->item(0)->nodeValue;

                    echoappend ($file, $row['name'] . ": <b>redirect ... </b>" . $title . " ... ");

                    if (strpos(strtolower($title),strtolower($row['name']))===false) {
                        echoappend ($file, "FAIL<br />");
                    }
                    else {
                        $header = curl_getinfo($redirectch);
                        echoappend ($file, $header['url']);
                        echoappend ($file, "SUCCESS<br />");
                    }

                    curl_close($redirectch);
                }
                else
                {
                    echoappend ($file, $row['name'] . ": <b>FAIL code</b>" . $urlinfo['http_code'] . "<br />");
                }

                curl_close($ch);

                $totallinks++;
            }
            echoappend ($file, '<br />');

            echoappend ($file, $site . ": " . $workinglinks . "/" . $totallinks . " links working. <br /><br />");


        }

        $conn = null;
        echo '<br /><b>connection closed</b><br /><br />';

    } catch(PDOException $e) {
            echo 'ERROR: ' . $e->getMessage();
    }
4

2 回答 2

2

简短的回答是使用 curl_multi_* 方法来并行化您的请求。

缓慢的原因是 Web 请求相对较慢。有时很慢。使用 curl_multi_* 函数可以同时运行多个请求。

需要注意的一件事是限制一次运行的请求数。换句话说,不要一次运行 16,000 个请求。也许从 16 岁开始,看看情况如何。

以下示例应该可以帮助您入门:

<?php

//
// Fetch a bunch of URLs in parallel. Returns an array of results indexed
// by URL.
//
function fetch_urls($urls, $curl_options = array()) {
  $curl_multi = curl_multi_init();
  $handles = array();

  $options = $curl_options + array(
    CURLOPT_HEADER         => true,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_NOBODY         => true,
    CURLOPT_FOLLOWLOCATION => true);

  foreach($urls as $url) {
    $handles[$url] = curl_init($url);
    curl_setopt_array($handles[$url], $options);
    curl_multi_add_handle($curl_multi, $handles[$url]);
  }

  $active = null;
  do {
    $status = curl_multi_exec($curl_multi, $active);
  } while ($status == CURLM_CALL_MULTI_PERFORM);

  while ($active && ($status == CURLM_OK)) {
    if (curl_multi_select($curl_multi) != -1) {
      do {
        $status = curl_multi_exec($curl_multi, $active);
      } while ($status == CURLM_CALL_MULTI_PERFORM);
    }
  }

  if ($status != CURLM_OK) {
    trigger_error("Curl multi read error $status\n", E_USER_WARNING);
  }

  $results = array();
  foreach($handles as $url => $handle) {
    $results[$url] = curl_getinfo($handle);
    curl_multi_remove_handle($curl_multi, $handle);
    curl_close($handle);    
  }
  curl_multi_close($curl_multi);

  return $results;
}

//
// The urls to test
//
$urls = array("http://google.com", "http://yahoo.com", "http://google.com/probably-bogus", "http://www.google.com.au");

//
// The number of URLs to test simultaneously
//
$request_limit = 2;

//
// Test URLs in batches
//
$redirected_urls = array();
for ($i = 0 ; $i < count($urls) ; $i += $request_limit) {
  $results = fetch_urls(array_slice($urls, $i, $request_limit));
  foreach($results as $url => $result) {
    if ($result['http_code'] == 200) {
      $status = "Worked!";
    } else {
      $status = "FAILED with {$result['http_code']}";
    }
    if ($result["redirect_count"] > 0) {
      array_push($redirected_urls, $url);
      echo "{$url}: ${status}\n";
    } else {
      echo "{$url}: redirected to {$result['url']} and {$status}\n";
    }
  }
}

//
// Handle redirected URLs
//
echo "Processing redirected URLs...\n";
for ($i = 0 ; $i < count($redirected_urls) ; $i += $request_limit) {
  $results = fetch_urls(array_slice($redirected_urls, $i, $request_limit), array(CURLOPT_FOLLOWLOCATION => false));
  foreach($results as $url => $result) {
    if ($result['http_code'] == 301) {
      echo "{$url} permanently redirected to {$result['url']}\n";
    } else if ($result['http_code'] == 302) {
      echo "{$url} termporarily redirected to {$result['url']}\n";
    } else {
      echo "{$url}: FAILED with {$result['http_code']}\n";
    }
  }
}

上面的代码批量处理一个 URL 列表。它分两遍工作。在第一遍中,每个请求都配置为遵循重定向,并简单地报告每个 URL 最终导致请求成功还是失败。

第二遍处理在第一遍中检测到的所有重定向 URL,并报告重定向是永久重定向(意味着您可以使用新 URL 更新数据库)还是临时重定向(意味着您不应更新数据库)。

笔记:

在您的原始代码中,您有以下行,它不会按您期望的方式工作:

else if ($urlinfo['http_code'] == 301 || 302)

该表达式将始终返回 TRUE。正确的表达是:

else if ($urlinfo['http_code'] == 301 || $urlinfo['http_code'] == 302)
于 2012-11-23T05:49:14.163 回答
0

另外,放

set_time_limit(0);

在脚本的顶部停止它在达到 30 秒时中止。

于 2012-11-23T05:53:52.220 回答