php - 正则表达式“不跟随”

Question

我正在编写一个用于过滤目的的自定义 URL 检测器，但遇到了非 URL 拼写错误的问题。

在英语中，句号分隔的两个句子应该有一个空格，但在大多数情况下，用户并没有遵守这个规则。

我必须匹配没有协议前缀的 URL，基本上只是域名和 2 或 3 个字符的 TLD。如何排除超过 TLD 的 3 个字符规则的字符串？

例子：

youtube.com (should match)

something.This (fragment of a sentence. Should not match because "This" contains 4 chars.)

请注意，这些字符串可能位于大海捞针中的任何位置（开始、中间、结束）。我目前的正则表达式是这样的：

.'((https?|ftp)://)?'         // Protocol (optional)
.'(www(\.|\%2[Ee]))?'         // www prefix (optional)
.'([a-zA-Z-]+(\.|\%2[Ee]))+'  // domain strings separated by dot
.'([a-zA-Z-]{2,3})'           // tld 2 or 3 chars (should not be followed by another alpha)
.'([/\?]\S*)*'                // subdirectory or GET (optional)

score 1 · Accepted Answer

如果我想为该成就修改您的正则表达式，那么我将在 tld 检查后添加一个积极的前瞻：

((https?|ftp):\/\/)?(www(\.|\%2[Ee]))?([a-zA-Z-]+(\.|\%2[Ee]))+([a-zA-Z-]{2,3}(?=\W|\b))([\/\?]\S*)*

你可以在这里看到：

((https?|ftp)://)?         // Protocol (optional)
(www(\.|\%2[Ee]))?         // www prefix (optional)
([a-zA-Z-]+(\.|\%2[Ee]))+  // domain strings separated by dot
([a-zA-Z-]{2,3}(?=\W|\b))  // ... following a non-word character or a word boundary
([/\?]\S*)*                // sub directory or GET (optional)

现场演示

php - 正则表达式“不跟随”

1 回答 1

Related

Reference