由于您要检查的网站是由不同的人创建的,因此不可能有一个单一的线路来检测大量网站上的链接是否损坏。
我建议您为每个站点创建一个简单的函数,以检测该特定站点的链接是否损坏。当您要检查链接时,您将根据域名决定在外部站点的 HTML 上运行哪个函数。
You can use parse_url() to extract the domain/host from the file links:
// Get your url from the database. Here I'll just set it:
$file_url_from_database = 'http://example.com/link/to/file?var=1&hello=world#file'
$parsed_link = parse_url($file_url_from_database);
$domain = $parsed_link['host']; // $domain now equals 'example.com'
You could store the function names in an associative array and call them that way:
function check_domain_com(){ ... }
function check_example_com(){ ... }
$link_checkers = array();
$link_checkers['domain.com'] = 'check_domain_com';
$link_checkers['example.com'] = 'check_example_com';
or store the functions in the array (PHP >=5.3).
$link_checkers = array();
$link_checkers['domain.com'] = function(){ ... };
$link_checkers['example.com'] = function(){ ... };
and call these with
if(isset($link_checkers[$domain]))
// call the function stored under the index 'example.com'
call_user_func($link_checkers[$domain]);
else
throw( new Exception("I don't know how to check the domain $domain") );
Alternatively you could just use a bunch of if statements
if($domain == 'domain.com')
check_domain_com();
else if($domain == 'example.com')
check_example_com(); // this function is called
The functions could return a boolean (true or false; 0 or 1) to use, or call another function themselves if needed (for example to add an extra CSS class to broken links).
I did something similar recently, though I was fetching metadata for stock photography from multiple sites. I used an abstract class because I had a few functions to run for each site.
As a side note, it would be wise to store the last checked date in your database and limit the checking rate to something like 24 or 48 hours (or further apart depending on your needs).
Edit to clarify implementation a little:
As making HTTP requests to other websites is potentially very slow, you will want to check and update link statuses independently of page loads. You could achieve this like this:
- A script could run every 12 hours and check all links from the database that were last checked more than 24 hours ago. For each 'old' link, it would update the
active
and last_checked
columns in your database appropriately.
- When someone requests a page, your script would read from the
active
column in your database instead of downloading the remote page to check every time.
- (extra thought) When a new link is submitted, it is checked immediately in the script, or added to a queue to be checked by the server as soon as possible.
As people can easily click a link to check it's current state, it would be redundant to allow them to click a button to check from your page (nothing against the idea though).
Note that the potentially resource-heavy update-all script should not be executable (accessible) via web.