我正在尝试为 Twitter 理解的有效 URL 设置正则表达式,具有以下特征:
- 它可能只有有效的域扩展
- 它可以以 http:// 或 https:// 或无开头
- 如果它不是以 http:// 或 https:// 开头,那么它在以下情况下被视为有效域名:
- 它是一个有效的域扩展名,并且包含 2 个以上的字母(.com、.org 等)
- 这是一个包含 2 个字母的有效域扩展名,并且它是大写字母(.CO、.ES 等)
所以,我现在的问题是我有单独的正则表达式用于带有http和没有它的 URL,并且http://example.com被计算两次,一次用于http regex 一次用于非 http regex ,这个应该包含一个要排除的术语http 和 https,这是我失败的地方,比如"(http:\/\/|https:\/\/){0}"
or "^(http:\/\/)"
。
基本上这个问题归结为:如何在正则表达式中为http 前缀 URLS regex计算http://example.com并避免在非 http 前缀 URLS regex中使用它,即避免计算 example.com
我的代码:
Dim validDomainExtensions As String = "(aero|arpa|asia|a[cdefgilmnoqrstuwxz]|biz|b[abdefghijmnorstvwyz]|cat|com|coop|c[acdfghiklmnorsuvxyz]|d[ejkmoz]|edu|e[ceghrstu]|f[ijkmor]|gov|g[abdefghilmnpqrstuwy]|h[kmnrtu]|info|int|i[delmnoqrst]|jobs|j[emop]|k[eghimnprwyz]|l[abcikrstuvy]|mil|mobi|museum|m[acdghklmnopqrstuvwxyz]|name|net|n[acefgilopruz]|om|org|pro|p[aefghklmnrstwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|travel|t[cdfghjklmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]){1}"
Dim validDomainExtensionsIgnoreCase As String = "(aero|arpa|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|travel){1}"
Dim validDomainExtensionsUpperCase As String = "(A[CDEFGILMNOQRSTUWXZ]|B[ABDEFGHIJMNORSTVWYZ]|C[ACDFGHIKLMNORSUVXYZ]|D[EJKMOZ]|E[CEGHRSTU]|F[IJKMOR]|G[ABDEFGHILMNPQRSTUWY]|H[KMNRTU]|I[DELMNOQRST]|J[EMOP]|K[EGHIMNPRWYZ]|L[ABCIKRSTUVY]|M[ACDGHKLMNOPQRSTUVWXYZ]|N[ACEFGILOPRUZ]|OM|P[AEFGHKLMNRSTWY]|QA|R[EOUW]|S[ABCDEGHIJKLMNORTVYZ]|T[CDFGHJKLMNOPRTVWZ]|U[AGKMSYZ]|V[ACEGINU]|W[FS]|Y[ETU]|Z[AMW]){1}"
Dim validDomainName As String = String.Concat("[\w\-_]+(\.[\w\-_]+)*[\.]{1}", _
validDomainExtensions, _
"([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?")
Dim validDomainNameSinHTTP1 As String = String.Concat("[\w\-_]+(\.[\w\-_]+)*[\.]{1}", _
validDomainExtensionsIgnoreCase, _
"([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?")
Dim validDomainNameSinHTTP2 As String = String.Concat("[\w\-_]+(\.[\w\-_]+)*[\.]{1}", _
validDomainExtensionsUpperCase, _
"([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?")
Dim regxHTTP As New Regex(String.Concat("(http:\/\/|https:\/\/)+", validDomainName), RegexOptions.IgnoreCase)
Dim regxSinHTTP1 As New Regex(String.Concat("(http:\/\/|https:\/\/){0}", validDomainNameSinHTTP1), RegexOptions.IgnoreCase)
Dim regxSinHTTP2 As New Regex(String.Concat(validDomainNameSinHTTP2))
Dim matchesHTTP As MatchCollection = regxHTTP.Matches(txtTweet.Text)
Dim matchesSinHTTP1 As MatchCollection = regxSinHTTP1.Matches(txtTweet.Text)
Dim matchesSinHTTP2 As MatchCollection = regxSinHTTP2.Matches(txtTweet.Text)
textoSinUrls = regxHTTP.Replace(txtTweet.Text, "")
textoSinUrls = regxSinHTTP1.Replace(textoSinUrls, "")
textoSinUrls = regxSinHTTP2.Replace(txtTweet.Text, "")
For Each match As Match In matchesHTTP
txtUrlsDetectadas.Text = String.Concat(match.Value, vbNewLine, txtUrlsDetectadas.Text)
If match.Value.Substring(0, 8) = "https://" Then
NUrlsHTTPS += 1
Else
NUrlsHTTP += 1
End If
Next
For Each match As Match In matchesSinHTTP1
'It fails here, as match.Value is actually example.com if I actually typed http://excample.com'
If match.Value.Substring(0, 7) <> "http://" Then
txtUrlsDetectadas.Text = String.Concat(match.Value, vbNewLine, txtUrlsDetectadas.Text)
NUrlsHTTP += 1
End If
Next
For Each match As Match In matchesSinHTTP2
If match.Value.Substring(0, 7) <> "http://" Then
txtUrlsDetectadas.Text = String.Concat(match.Value, vbNewLine, txtUrlsDetectadas.Text)
NUrlsHTTP += 1
End If
Next