javascript - 使用大量术语，搜索页面文本并用链接替换单词

Question

不久前，我发布了这个问题，询问是否可以将文本转换为 HTML 链接，如果它们与我的数据库中的术语列表匹配。

我有一个相当大的术语列表——大约 6000 个。

关于这个问题的公认答案非常好，但是我从未使用过 XPath，当问题开始出现时我不知所措。有一次，在修改了代码之后，我设法在我们的数据库中添加了超过 40,000 个随机字符——其中大部分需要手动删除。从那以后，我对这个想法失去了信心，更简单的 PHP 解决方案根本不足以有效地处理数据量和术语数量。

我的下一个解决方案尝试是编写一个 JS 脚本，该脚本在页面加载后检索术语并将它们与页面上的文本进行匹配。

这个答案有一个我想尝试的想法。

我会使用 AJAX 从数据库中检索术语，以构建如下对象：

var words = [
    {
        word: 'Something',
        link: 'http://www.something.com'
    },
    {
        word: 'Something Else',
        link: 'http://www.something.com/else'
    }
];

构建对象后，我将使用这种代码：

//for each array element
$.each(words,
    function() {
        //store it ("this" is gonna become the dom element in the next function)
        var search = this;
        $('.message').each(
            function() {
                //if it's exactly the same
                if ($(this).text() === search.word) {
                    //do your magic tricks
                    $(this).html('<a href="' + search.link + '">' + search.link + '</a>');
                }
            }
        );
    }
);

现在，乍一看，这里有一个主要问题：有 6,000 个术语，这段代码在任何方面都足够高效来完成我想要做的事情吗？.

一种选择可能是在 AJAX 与之通信的 PHP 脚本中执行一些开销。例如，我可以发送帖子的 ID，然后 PHP 脚本可以使用 SQL 语句从帖子中检索所有信息并将其与所有 6,000 个术语进行匹配。然后对 JavaScript 的返回调用可能只是匹配术语，这将显着减少上述 jQuery 将进行的匹配数（最多约 50 个）。

我对脚本在用户浏览器上“加载”几秒钟没有问题，只要它不影响他们的 CPU 使用率或类似的东西。

所以，两个问题合二为一：

我可以做这个工作吗？
我可以采取哪些步骤使其尽可能高效？

提前致谢，

score 2 · Accepted Answer

您可以在插入时缓存结果。

基本上，当有人插入新帖子时，您会运行替换过程，而不是仅仅将其插入数据库。

如果您的帖子像这样存储在数据库中

Table: Posts
id        post
102       "Google is a search engine"

您可以创建另一个表

Table: cached_Posts
id       post_id   date_generated   cached_post                             
1        102       2012-10-10       <a href="http://google.com">Google</a> is a search engine"

检索帖子时，首先检查它是否存在于 cached_Posts 表中。

你应该保留原来的原因可能是你可能会添加一个新的关键字来替换。您所要做的就是重新制作缓存。

通过这种方式，不需要客户端 JS，并且您只需在每个帖子中执行一次，因此您的结果应该很快就会出现。

score 1 · Accepted Answer

这是我想出的相对简单的东西。抱歉，没有彻底的测试，也没有性能测试。我保证它可以进一步优化，我只是没有时间去做。我发表了一些评论以使其更简单http://pastebin.com/nkdTSvi6对于 StackOverflow 来说可能有点长，但无论如何我都会在这里发布。pastebin 是为了更舒适地观看。

function buildTrie(hash) {
    "use strict";
    // A very simple function to build a Trie
    // we could compress this later, but simplicity
    // is better for this example. If we don't
    // perform well, we'll try to optimize this a bit
    // there is a room for optimization here.
    var p, result = {}, leaf, i;
    for (p in hash) {
        if (hash.hasOwnProperty(p)) {
            leaf = result;
            i = 0;
            do {
                if (p[i] in leaf) {
                    leaf = leaf[p[i]];
                } else {
                    leaf = leaf[p[i]] = {};
                }
                i += 1;
            } while (i < p.length);
            // since, obviously, no character
            // equals to empty character, we'll
            // use it to store the reference to the
            // original value
            leaf[""] = hash[p];
        }
    }
    return result;
}

function prefixReplaceHtml(html, trie) {
    "use strict";
    var i, len = html.length, result = [], lastMatch = 0,
        current, leaf, match, matched, replacement;
    for (i = 0; i < len; i += 1) {
        current = html[i];
        if (current === "<") {
            // don't check for out of bounds access
            // assume we never face a situation, when
            // "<" is the last character in an HTML
            if (match) {
                result.push(
                    html.substring(lastMatch, i - matched.length),
                    "<a href=\"", match, "\">", replacement, "</a>");
                lastMatch = i - matched.length + replacement.length;
                i = lastMatch - 1;
            } else {
                if (matched) {
                    // go back to the second character of the
                    // matched string and try again
                    i = i - matched.length;
                }
            }
            matched = match = replacement = leaf = "";
            if (html[i + 1] === "a") {
                // we want to skip replacing inside
                // anchor tags. We also assume they
                // are never nested, as valid HTML is
                // against that idea
                if (html[i + 2] in
                    { " " : 1, "\t" : 1, "\r" : 1, "\n" : 1 }) {
                    // this is certainly an anchor
                    i = html.indexOf("</a", i + 3) + 3;
                    continue;
                }
            }
            // if we got here, it's a regular tag, just look
            // for terminating ">"
            i = html.indexOf(">", i + 1);
            continue;
        }
        // if we got here, we need to start checking
        // for the match in the trie
        if (!leaf) {
            leaf = trie;
        }
        leaf = leaf[current];
        // we prefer longest possible match, just like POSIX
        // regular expressions do
        if (leaf && ("" in leaf)) {
            match = leaf[""];
            replacement = html.substring(
                i - (matched ? matched.length : 0), i + 1);
        }
        if (!leaf) {
            // newby-style inline (all hand work!) pay extra
            // attention, this code is duplicated few lines above
            if (match) {
                result.push(
                    html.substring(lastMatch, i - matched.length),
                    "<a href=\"", match, "\">", replacement, "</a>");
                lastMatch = i - matched.length + replacement.length;
                i = lastMatch - 1;
            } else {
                if (matched) {
                    // go back to the second character of the
                    // matched string and try again
                    i = i - matched.length;
                }
            }
            matched = match = replacement = "";
        } else if (matched) {
            // perhaps a bit premature, but we'll try to avoid
            // string concatenation, when we can.
            matched = html.substring(i - matched.length, i + 1);
        } else {
            matched = current;
        }
    }
    return result.join("");
}

function testPrefixReplace() {
    "use strict";
    var trie = buildTrie(
        { "x" : "www.xxx.com", "yyy" : "www.y.com",
          "xy" : "www.xy.com", "yy" : "www.why.com" });
    return prefixReplaceHtml(
        "<html><head>x</head><body><a >yyy</a><p>" +
            "xyyy yy x xy</p><abrval><yy>xxy</yy>", trie);
}

score 1 · Accepted Answer

正如invertedSpear 所说，你不应该仅仅因为你不能让它工作就放弃PHP。Javascript 解决方案在减轻服务器负载的同时，最终用户可能会觉得速度较慢。您也可以始终缓存服务器端解决方案，而您实际上无法在客户端进行缓存。

话虽如此，这些是我对您的 Javascript 的看法。我自己没有尝试过这样的事情，所以我无法评论你是否可以让它工作，但有几件事我认为可能存在问题：

jQuery 的$.each()功能虽然非常有用，但效率并不高。尝试运行这个基准，你会明白我的意思：http: //jsperf.com/jquery-each-vs-for-loops/9
如果您要$('.message')在循环的每次迭代中运行，您可能会进行大量相当昂贵的 DOM 遍历。如果可能，您应该在开始循环之前将此操作的结果缓存在一个变量中words
您是否依赖于您的“搜索”文本的每个实例都被具有该类的任何元素封装message并且周围没有其他文本？因为这就是你的if ($(this).text() === search.word) {行所暗示的。在您的另一个问题中，您似乎建议您在要替换的术语周围有更多文本，在这种情况下，您可能需要查看正则表达式来执行替换。您还需要确保文本不包含在<a>标签中。

score 0 · Accepted Answer

如果您可以访问消息和单词列表的数据库，我真的建议您在 PHP 中完成所有操作。虽然这可以在 JS 中完成，但作为服务器端脚本会好很多。

在 JS 中，基本上，你必须

加载消息
加载“字典”
遍历字典的每个单词
- 在 DOM 中查找匹配项（哎哟）
  - 代替

前 2 点是请求，这会产生相当大的开销。该循环将对客户端的 CPU 造成负担。

为什么我建议将其作为服务器端代码执行：

服务器更适合这些类型的工作
JS 在客户端浏览器上运行。每个客户端都是不同的（例如：有人可能使用性能较差的 IE，或者有人使用智能手机）

这在 PHP 中很容易做到。

<?php
    $dict[] = array('word' => 'dolor', 'link' => 'DOLORRRRRR');
    $dict[] = array('word' => 'nulla', 'link' => 'NULLAAAARRRR');

    //  Pretty sure there's a more efficient way to separate an array.. my PHP is rusty, sorry. 
    $terms = array();
    $replace = array();
    foreach ($dict as $v) {
        // If you want to make sure it's a complete word, add a space to the term. 
        $terms[] = ' ' . $v['word'] . ' ';
        $replace[] = ' '. $v['link'] . ' ';
    }

    $text = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.";

    echo str_replace($terms, $replace, $text);


    /* Output: 
    Lorem ipsum DOLORRRRRR sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure DOLORRRRRR in reprehenderit in voluptate velit esse cillum dolore eu fugiat NULLAAAARRRR pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
    */

?>

虽然这个脚本非常基本 - 它不会接受不同的情况。

我会做什么：

如果 PHP 的性能真的对你打击很大（我怀疑..），你可以替换它一次并保存它。然后，当您添加一个新单词时，删除缓存并重新生成它们（您可以编写一个 cron 来执行此操作）

score 0 · Accepted Answer

你可以做任何事情，问题是：你投入的时间值得吗？

第 1 步，放弃 AJAX 要求。Ajax 用于与服务器交互，向服务器提交少量数据并获得响应。不适合您想要的东西。

第 2 步，放弃 JS 要求，JS 用于与用户交互，你只是想传递一个文本块，其中一些单词替换为链接，这应该在服务器端处理。

第3步，专注于php，如果效率不高，攻击它。寻找提高效率的方法。你在 PHP 中尝试了什么？为什么它没有效率？

javascript - 使用大量术语，搜索页面文本并用链接替换单词

5 回答 5

Related

Reference