jquery - 文本匹配不适用于阿拉伯语问题可能是由于阿拉伯语的正则表达式

Question

我一直在努力为我的多语言网站添加一项功能，我必须在其中突出显示匹配的标签关键字。

此功能适用于英语版本，但不适用于阿拉伯语版本。

示例代码

    function HighlightKeywords(keywords)
    {        
        var el = $("#article-detail-desc");
        var language = "ar-AE";
        var pid = 32;
        var issueID = 18; 
        $(keywords).each(function()
        {
           // var pattern = new RegExp("("+this+")", ["gi"]); //breaks html
            var pattern = new RegExp("(\\b"+this+"\\b)(?![^<]*?>)", ["gi"]); //looks for match outside html tags
            var rs = "<a class='ad-keyword-selected' href='http://www.alshindagah.com/ar/search.aspx?Language="+language+"&PageId="+pid+"&issue="+issueID+"&search=$1' title='Seach website for:  $1'><span style='color:#990044; tex-decoration:none;'>$1</span></a>";
            el.html(el.html().replace(pattern, rs));
        });
    }   

HighlightKeywords(["you","الهدف","طهران","سيما","حاليا","Hello","34","english"]);

//Popup Tooltip for article keywords
     $(function() {
        $("#article-detail-desc").tooltip({
        position: {
            my: "center bottom-20",
            at: "center top",
            using: function( position, feedback ) {
            $( this ).css( position );
            $( "<div>" )
            .addClass( "arrow" )
            .addClass( feedback.vertical )
            .addClass( feedback.horizontal )
            .appendTo( this );
        }
        }
        });
    });

我将关键字存储在数组中，然后将它们与特定 div 中的文本匹配。

我不确定是由于 Unicode 还是什么问题。在这方面的帮助表示赞赏。

score 8 · Accepted Answer

这个答案分为三个部分

为什么它不起作用
一个如何用英语处理它的例子（旨在由对阿拉伯语有线索的人改编成阿拉伯语）
一个对阿拉伯语一无所知的人（我）尝试制作阿拉伯语版本:-)

为什么它不起作用

至少部分问题在于您依赖于\bassertion，它（与其对应\B的\w, 和一样\W）以英语为中心。你不能用其他语言来依赖它（甚至，真的，用英语——见下文）。

这是规范\b中的定义：

生产断言:: \ b通过返回一个内部AssertionTester闭包进行评估，该闭包接受一个State参数x并执行以下操作：

让e我们吧x。endIndex

打电话IsWordChar(e–1)给a结果Boolean。

打电话IsWordChar(e)给b结果Boolean。

如果a是true和b是false，返回true。

如果a是false和b是true，返回true。

返回false。

... whereIsWordChar被进一步定义为基本上意味着以下 63 个字符之一：

abcdefghijklmnopqrstu vwxyz
ABCDEFGHIJKLMNOPQRSTU VWXYZ
0 1 2 3 4 5 6 7 8 9 _

例如，26 个英文字母atoz大写或小写，数字0to9和_。（这意味着你甚至不能依赖\b, \B, \w, 或\W英语，因为English有像“Voilà”这样的借词，但那是另一回事了。）

第一个使用英语的例子

您必须使用不同的机制来检测阿拉伯语中的单词边界。如果你能想出一个包含所有组成单词的阿拉伯语“代码点”（正如 Unicode 所说）的字符类，你可以使用这样的代码：

var keywords = {
    "laboris": true,
    "laborum": true,
    "pariatur": true
    // ...and so on...
};
var text = /*... get the text to work on... */;
text = text.replace(
    /([abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_]+)([^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_]+)?/g,
    replacer);

function replacer(m, c0, c1) {
    if (keywords[c0]) {
        c0 = '<a href="#">' + c0 + '</a>';
    }
    return c0 + c1;
}

对此的注释：

我用这个类[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]来表示“一个单词字符”。显然，您必须（明显）为阿拉伯语更改此设置。
我用这个类[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]来表示“不是单词字符”。这与上一开始带有否定 ( ) 的类相同^。
正则表达式查找任何一系列“单词字符”，然后是可选的一系列非单词字符，(...)对两者都使用捕获组 ( )。
String#replace使用匹配的全文调用replacer函数，后跟每个捕获组作为参数。
该replacer函数在地图中查找第一个捕获组（单词）keywords以查看它是否是关键字。如果是这样，它会将其包裹在锚中。
该replacer函数返回该可能换行的单词以及其后的非单词文本。
String#replace使用 from 的返回值replacer替换匹配的文本。

下面是一个完整的例子：Live Copy | 直播源

<!DOCTYPE html>
<html>
<head>
<meta charset=utf-8 />
<title>Replacing Keywords</title>
</head>
<body>
  <p>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
  
  <script src="http://code.jquery.com/jquery-1.9.1.min.js"></script>
  <script>
    (function() {
      // Our keywords. There are lots of ways you can produce
      // this map, here I've just done it literally
      var keywords = {
        "laboris": true,
        "laborum": true,
        "pariatur": true
      };
      
      // Loop through all our paragraphs (okay, so we only have one)
      $("p").each(function() {
        var $this, text;
        
        // We'll use jQuery on `this` more than once,
        // so grab the wrapper
        $this = $(this);
        
        // Get the text of the paragraph
        // Note that this strips off HTML tags, a
        // real-world solution might need to loop
        // through the text nodes rather than act
        // on the full text all at once
        text = $this.text();

        // Do the replacements
        // These character classes match JavaScript's
        // definition of a "word" character and so are
        // English-centric, obviously you'd change that
        text = text.replace(
          /([abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_]+)([^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_]+)?/g,
          replacer);
        
        // Update the paragraph
        $this.html(text);
      });

      // Our replacer. We define it separately rather than
      // inline because we use it more than once      
      function replacer(m, c0, c1) {
        // Is the word in our keywords map?
        if (keywords[c0]) {
          // Yes, wrap it
          c0 = '<a href="#">' + c0 + '</a>';
        }
        return c0 + c1;
      }
    })();
  </script>
</body>
</html>

尝试用阿拉伯语做这件事

我接受了阿拉伯语版本的攻击。根据Wikipedia上 Unicode 页面中的阿拉伯语脚本，使用了几个代码范围，但您示例中的所有文本都属于 U+0600 到 U+06FF 的主要范围。

这是我想出的：Fiddle（我更喜欢 JSBin，我在上面使用的，但我无法让文本以正确的方式出现。）

(function() {
    // Our keywords. There are lots of ways you can produce
    // this map, here I've just done it literally
    var keywords = {
        "الهدف": true,
        "طهران": true,
        "سيما": true,
        "حاليا": true
    };
    
    // Loop through all our paragraphs (okay, so we only have two)
    $("p").each(function() {
        var $this, text;
        
        // We'll use jQuery on `this` more than once,
        // so grab the wrapper
        $this = $(this);
        
        // Get the text of the paragraph
        // Note that this strips off HTML tags, a
        // real-world solution might need to loop
        // through the text nodes rather than act
        // on the full text all at once
        text = $this.text();
        
        // Do the replacements
        // These character classes just use the primary
        // Arabic range of U+0600 to U+06FF, you may
        // need to add others.
        text = text.replace(
            /([\u0600-\u06ff]+)([^\u0600-\u06ff]+)?/g,
            replacer);
        
        // Update the paragraph
        $this.html(text);
    });
    
    // Our replacer. We define it separately rather than
    // inline because we use it more than once      
    function replacer(m, c0, c1) {
        // Is the word in our keywords map?
        if (keywords[c0]) {
            // Yes, wrap it
            c0 = '<a href="#">' + c0 + '</a>';
        }
        return c0 + c1;
    }
})();

我对上面的英语功能所做的只是：

用作[\u0600-\u06ff]“字字符”和[^\u0600-\u06ff]“非字字符”。您可能需要添加此处列出的其他一些范围（例如适当的数字样式），但同样，您示例中的所有文本都属于这些范围。
从您的示例中将关键字更改为您的三个（其中似乎只有两个在文本中）。

对我非常不读阿拉伯文的眼睛来说，它似乎奏效了。

jquery - 文本匹配不适用于阿拉伯语问题可能是由于阿拉伯语的正则表达式

1 回答 1

这个答案分为三个部分

为什么它不起作用

第一个使用英语的例子

尝试用阿拉伯语做这件事

Related

Reference