3

我正在开展一个项目,该项目列出了来自 Oron、filepost、depositfiles 等公司的文件共享 url,该项目向我的网络中已识别的内容所有者和权利持有者报告了受版权保护的材料的共享。

为了更好地改进服务,该服务目前位于从 MySQL 数据库填充的表中,并在 php 中内置了一些过滤器,我希望能够识别已停止运行的链接。

我的想法是,当从 MySQL 数据库中检索数据时,将检查下载 URL 列条目(文件或文件主机页面的 url)以查看它们是否链接到允许用户开始下载的实际文件共享页面进程,如果他们正在工作并提供下载文件的能力,他们应该留下,链接文本或单元格颜色变为绿色,如果文件站点显示文件未找到或类似,链接文本或单元格背景颜色应变为红色。

目前没有活动或非活动链接的快速和简单的视觉表示。

我根据是否收到 404 错误对 url 进行了简单的验证,但很快意识到,鉴于这些网站没有 404 或重定向,它们将无法正常工作,它们会更改动态生成的页面以说明文件不可用或文件已被删除等。

我还合并了一个使用第三方文件共享链接检查服务的链接检查器脚本,但这需要手动检查和手动更新数据库。

我还检查了是否可以在页面上找到特定的字段或单词,但是给定的网站范围和网站上使用的更广泛的术语已被证明是准确的,并且难以在所有链接上实施.

如果可以根据活动状态过滤掉 url,这也会很有帮助。我猜如果颜色变化是由链接类或单元类样式管理的,我可以根据类过滤列,例如:链接死或链接活动。我想我可以做到这一点,因此不一定需要对基于类的过滤提供最后一点帮助。

任何帮助将不胜感激。

4

1 回答 1

2

由于您要检查的网站是由不同的人创建的,因此不可能有一个单一的线路来检测大量网站上的链接是否损坏。

我建议您为每个站点创建一个简单的函数,以检测该特定站点的链接是否损坏。当您要检查链接时,您将根据域名决定在外部站点的 HTML 上运行哪个函数。

You can use parse_url() to extract the domain/host from the file links:

// Get your url from the database. Here I'll just set it:
$file_url_from_database = 'http://example.com/link/to/file?var=1&hello=world#file'

$parsed_link = parse_url($file_url_from_database);
$domain = $parsed_link['host']; // $domain now equals 'example.com'

You could store the function names in an associative array and call them that way:

function check_domain_com(){ ... }
function check_example_com(){ ... }

$link_checkers = array();
$link_checkers['domain.com'] = 'check_domain_com';
$link_checkers['example.com'] = 'check_example_com';

or store the functions in the array (PHP >=5.3).

$link_checkers = array();
$link_checkers['domain.com'] = function(){ ... };
$link_checkers['example.com'] = function(){ ... };

and call these with

if(isset($link_checkers[$domain]))
    // call the function stored under the index 'example.com'
    call_user_func($link_checkers[$domain]); 
else
    throw( new Exception("I don't know how to check the domain $domain") );

Alternatively you could just use a bunch of if statements

if($domain == 'domain.com')
    check_domain_com();
else if($domain == 'example.com')
    check_example_com(); // this function is called

The functions could return a boolean (true or false; 0 or 1) to use, or call another function themselves if needed (for example to add an extra CSS class to broken links).

I did something similar recently, though I was fetching metadata for stock photography from multiple sites. I used an abstract class because I had a few functions to run for each site.

As a side note, it would be wise to store the last checked date in your database and limit the checking rate to something like 24 or 48 hours (or further apart depending on your needs).


Edit to clarify implementation a little:

As making HTTP requests to other websites is potentially very slow, you will want to check and update link statuses independently of page loads. You could achieve this like this:

  • A script could run every 12 hours and check all links from the database that were last checked more than 24 hours ago. For each 'old' link, it would update the active and last_checked columns in your database appropriately.
  • When someone requests a page, your script would read from the active column in your database instead of downloading the remote page to check every time.
  • (extra thought) When a new link is submitted, it is checked immediately in the script, or added to a queue to be checked by the server as soon as possible.

As people can easily click a link to check it's current state, it would be redundant to allow them to click a button to check from your page (nothing against the idea though).

Note that the potentially resource-heavy update-all script should not be executable (accessible) via web.

于 2012-06-23T10:54:35.770 回答