php - PHP - 查找帖子中的所有超链接，添加目标和 rel=nofollow 属性

Question

我需要找到一种方法来阅读用户发布的内容，以查找可能包含的任何超链接、创建锚标记、将目标和 rel=nofollow 属性添加到所有这些链接。

我遇到了一些像这样的 REGEX 解决方案：

 (?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

但是关于 SO 关于同一问题的其他问题，强烈建议不要使用 REGEX 而不是使用DOMDocumentPHP。

无论是最好的方法，我都需要添加一些上面提到的属性来强化网站上的所有外部链接。

score 2 · Accepted Answer

首先，您提到的指南建议不要使用正则表达式解析HTML。据我了解，您要做的是解析来自用户的纯文本并将其转换为 HTML。为此，正则表达式通常就可以了。

（请注意，我假设您自己将文本解析为链接，并且没有为此使用外部库。在后一种情况下，您需要修复库输出的 HTML，为此您应该使用它DOMDocument来迭代所有<a>标签和添加它们适当的属性。）

现在，您可以通过两种方式解析它：服务器端或客户端。

服务器端

优点：

它输出准备好使用的 HTML。
它不需要用户启用 Javascript。

缺点：

您需要rel="nofollow"为机器人添加属性以不跟随链接。

客户端

优点：

您不需要rel="nofollow"为机器人添加属性，因为它们首先看不到链接 - 它们是使用 Javascript 生成的，并且机器人通常不解析 Javascript。

缺点：

以这种方式创建链接需要用户启用 Javascript。
在 Javascript 中实现类似的东西会给人一种网站很慢的印象，尤其是在需要解析大量文本的情况下。
它使缓存解析的文本变得困难。

我将专注于在服务器端实现它。

服务器端实现

因此，为了解析来自用户输入的链接并添加您想要的任何属性，您可以使用以下内容：

<?php
function replaceLinks($text)
{
    $regex = '/'
      . '(?<!\S)'
      . '(((ftp|https?)?:?)\/\/|www\.)'
      . '(\S+?)'
      . '(?=$|\s|[,]|\.\W|\.$)'
      . '/m';

    return preg_replace_callback($regex, function($match)
    {
        return '<a'
          . ' target=""'
          . ' rel="nofollow"'
          . ' href="' . $match[0] . '">'
          . $match[0]
          . '</a>';
    }, $text);
}

解释：

(?<!\S): 前面没有非空白字符。
(((ftp|https?)?:?)\/\/|www\.): 接受ftp://, http://, https://, ://,//和www.作为 URL 的开头。
(\S+?)以非贪婪的方式匹配所有不是空格的东西。
(?=$|\s|[,]|\.\W|\.$)每个 URL 必须后跟行尾、空格、逗号、点后跟除字符以外的字符\w（这是为了允许.com等.co.jp匹配）或点后跟行尾。
mflag - 匹配多行文本。

测试

现在，为了支持我的说法，我添加了一些测试用例：

$tests = [];
$tests []= ['http://example.com', '<a target="" rel="nofollow" href="http://example.com">http://example.com</a>'];
$tests []= ['https://example.com', '<a target="" rel="nofollow" href="https://example.com">https://example.com</a>'];
$tests []= ['ftp://example.com', '<a target="" rel="nofollow" href="ftp://example.com">ftp://example.com</a>'];
$tests []= ['://example.com', '<a target="" rel="nofollow" href="://example.com">://example.com</a>'];
$tests []= ['//example.com', '<a target="" rel="nofollow" href="//example.com">//example.com</a>'];
$tests []= ['www.example.com', '<a target="" rel="nofollow" href="www.example.com">www.example.com</a>'];
$tests []= ['user@www.example.com', 'user@www.example.com'];
$tests []= ['testhttp://example.com', 'testhttp://example.com'];
$tests []= ['example.com', 'example.com'];
$tests []= [
    'test http://example.com',
    'test <a target="" rel="nofollow" href="http://example.com">http://example.com</a>'];
$tests []= [
    'multiline' . PHP_EOL . 'blah http://example.com' . PHP_EOL . 'test',
    'multiline' . PHP_EOL . 'blah <a target="" rel="nofollow" href="http://example.com">http://example.com</a>' . PHP_EOL . 'test'];
$tests []= [
    'text //example.com/slashes.php?parameters#fragment, some other text',
    'text <a target="" rel="nofollow" href="//example.com/slashes.php?parameters#fragment">//example.com/slashes.php?parameters#fragment</a>, some other text'];
$tests []= [
    'text //example.com. new sentence',
    'text <a target="" rel="nofollow" href="//example.com">//example.com</a>. new sentence'];

每个测试用例由两部分组成：源输入和预期输出。我使用以下代码来确定该函数是否通过了上述测试：

foreach ($tests as $test)
{
    list ($source, $expected) = $test;
    $actual = replaceLinks($source);
    if ($actual != $expected)
    {
        echo 'Test ' . $source . ' failed.' . PHP_EOL;
        echo 'Expected: ' . $expected . PHP_EOL;
        echo 'Actual:   ' . $actual . PHP_EOL;
        die;
    }
}
echo 'All tests passed' . PHP_EOL;

我认为这让你知道如何解决这个问题。随意添加更多测试并尝试使用正则表达式本身，以使其适合您的特定需求。

score 1 · Accepted Answer

1

您可能对Goutte感兴趣，您可以定义自己的过滤器等。

于 2014-05-03T06:22:55.570 回答

score 0 · Accepted Answer

使用 jquery 获取要发布的内容并在将其发布到 PHP 之前对其进行处理。

$('#idof_content').val(
  $('#idof_content').val().replace(/\b(http(s|):\/\/|)(www\.\S+)/ig,
    "<a href='http\$2://\$3' target='_blank' rel='nofollow'>\$3</a>"));

php - PHP - 查找帖子中的所有超链接，添加目标和 rel=nofollow 属性

3 回答 3

服务器端

客户端

服务器端实现

测试

Related

Reference