javascript - 正则表达式匹配 domain.com 但不匹配 @domain.com

Question

这应该很简单，但它让我望而却步。有许多好的和坏的正则表达式方法来匹配一个 URL，有或没有协议，有或没有 www。我遇到的问题是（在 javascript 中）：如果我使用正则表达式匹配文本字符串中的 URL，并将其设置为仅匹配“domain.com”，它还会捕获电子邮件地址的域（ '@'之后的部分），我不想要。消极的后视解决了它 - 但显然不是在 JS 中。

这是迄今为止我最近的成功：

 /^(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g

但如果匹配项不在字符串的开头，则会失败。而且我确定我以错误的方式处理它。那里有一个简单的答案吗？

编辑：修改正则表达式以回应下面的一些评论（坚持使用“www”而不是允许子域：

\b(www\.)?([^@])(\w*\.)(\w{2,3})(\.\w{2,3})?(\/\S*)?$

然而，正如评论中提到的，这仍然与@之后的域匹配。

谢谢

score 1 · Accepted Answer

如果匹配不在字符串的开头，则失败

这是因为^在比赛开始时：

/(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g

js> "www.foobar.com".match(/(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g)
["www.foobar.com"]
js> "aoeuaoeu foobar.com".match(/(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g)
[" foobar.com"]
js> "toto@aoeuaoeu foobar.com".match(/(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g)
[" foobar.com"]
js> "toto@aoeuaoeu toto@foobar.com".match(/(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g)
["foobar.com"]

尽管它仍然匹配域之前的空格。而且它对域做出了错误的假设……</p>

xyz.example.org是一个与您的正则表达式不匹配的有效域；
www.3x4mpl3.org是一个与您的正则表达式不匹配的有效域；
example.co.uk是一个与您的正则表达式不匹配的有效域；
ουτοπία.δπθ.gr是一个与您的正则表达式不匹配的有效域。

什么定义了合法域名？它只是一个由点分隔的 utf-8 字符序列。它不能有两个彼此跟随的点，并且规范名称是\w\.\w\w（因为我认为不存在单字母 tld）。

不过，我这样做的方式是简单地匹配看起来像域的所有内容，方法是使用单词边界 ( ) 获取带有点分隔符的所有文本\b：

/\b(\w+\.)+\w+\b/g

js> "aoe toto.example.org  uaoeu foo.bar aoeuaoeu".match(/\b(\w+\.)+\w+\b/g)
["toto.example.org", "foo.bar"]
js> "aoe toto@example.org toto.example.org  uaoeu foo.bar aoeuaoeu".match(/\b(\w+\.)+\w+\b/g)
["example.org", "toto.example.org", "foo.bar"]
js> "aoe toto@example.org toto.example.org  uaoeu foo.bar aoeuaoeu f00bar.com".match(/\b(\w+\.)+\w+\b/g)
["example.org", "toto.example.org", "foo.bar", "f00bar.com"]

然后进行第二轮检查该域是否真的存在于找到的域列表中。缺点是 javascript 中的正则表达式无法检查 unicode 字符，并且要么\b或\w不会接受ουτοπία.δπθ.gr为有效域名。

在 ES6 中，有/u修饰符，它应该适用于最新的浏览器（但到目前为止我没有测试过）：

"ουτοπία.δπθ.gr aoe toto@example.org toto.example.org  uaoeu foo.bar aoeuaoeu".match(/\b(\w+\.)+\w+\b/gu)

编辑：

消极的后视解决了它 - 但显然不是在 JS 中。

是的，它会：为了跳过所有电子邮件地址，这里是正则表达式实现背后的工作外观：

/(?![^@])?\b(\w+\.)+\w+\b/g

js> "aoe toto@example.org toto.example.org  uaoeu foo.bar aoeuaoeu f00bar.com".match(/(?<![^@])?\b(\w+\.)+\w+\b/g)
["toto.example.org", "foo.bar", "f00bar.com"]

虽然它和 unicode 一样……但它很快就会出现在 JS 中……</p>

唯一的方法是@在匹配的正则表达式中实际保留，并丢弃任何包含@的匹配：

js> "toto.net aoe toto@example.org toto.example.org  uaoeu foo.bar aoeuaoeu f00bar.com".match(/@?\b\w+\.+\w+\b/g).map(function (x) { if (!x.match(/@/)) return x })
["toto.net", (void 0), "toto.example", "foo.bar", "f00bar.com"]

或者使用 ES6/JS1.7 中的新列表推导，现代浏览器中应该有它......</p>

[x for x of "toto.net aoe toto@example.org toto.example.org  uaoeu foo.bar aoeuaoeu f00bar.com".match(/@?\b\w+\.+\w+\b/g) if (!x.match(/@/))];

最后一次更新：

/@?\b(\w*[^\W\d]+\w*\.+)+[^\W\d_]{2,}\b/g

> "x.y tot.toc.toc $11.00 11.com 11foo.com toto.11 toto.net aoe toto@example.org toto.example.org  uaoeu foo.bar aoeuaoeu f00bar.com".match(/@?\b(\w*[^\W\d]+\w*\.+)+[^\W\d_]{2,}\b/g).filter(function (x) { if (!x.match(/@/)) return x })
[ 'tot.toc.toc',
  '11foo.com',
  'toto.net',
  'toto.example.org',
  'foo.bar',
  'f00bar.com' ]

score 0 · Accepted Answer

经过很多混乱之后，这最终奏效了（@zmo 的最终评论有一个明确的帽子提示）：

var rx = /\b(www\.)?(\w*@)?([a-zA-Z\-]*\.)(com|org|net|edu|COM|ORG|NET|EDU)(\.au)?(\/\S*)?/g;
var link = txt.match(rx);
    if(link !== null) {
    for(var i = 0; i < link.length; i++) {
      if (link[i].indexOf('@') == -1) {
         //create link
       } else {
        //create mailto;
       }
       }
       }

我知道关于子域、TLD 等的限制（@zmo 已经在上面解决了 - 如果您需要捕获所有 URL，我建议您调整该代码），但这不是我的主要问题。我的答案中的代码允许匹配没有“www.”的文本字符串中存在的 URL，而不会捕获电子邮件地址的域。

javascript - 正则表达式匹配 domain.com 但不匹配 @domain.com

2 回答 2

Related

Reference