2

我正在尝试为 Twitter 理解的有效 URL 设置正则表达式,具有以下特征:

  • 它可能只有有效的域扩展
  • 它可以以 http:// 或 https:// 或无开头
  • 如果它不是以 http:// 或 https:// 开头,那么它在以下情况下被视为有效域名:
    • 它是一个有效的域扩展名,并且包含 2 个以上的字母(.com、.org 等)
    • 这是一个包含 2 个字母的有效域扩展名,并且它是大写字母(.CO、.ES 等)

所以,我现在的问题是我有单独的正则表达式用于带有http和没有它的 URL,并且http://example.com被计算两次,一次用于http regex 一次用于非 http regex ,这个应该包含一个要排除的术语http 和 https,这是我失败的地方,比如"(http:\/\/|https:\/\/){0}"or "^(http:\/\/)"

基本上这个问题归结为:如何在正则表达式中为http 前缀 URLS regex计算http://example.com并避免在非 http 前缀 URLS regex中使用它,即避免计算 example.com

我的代码:

    Dim validDomainExtensions As String = "(aero|arpa|asia|a[cdefgilmnoqrstuwxz]|biz|b[abdefghijmnorstvwyz]|cat|com|coop|c[acdfghiklmnorsuvxyz]|d[ejkmoz]|edu|e[ceghrstu]|f[ijkmor]|gov|g[abdefghilmnpqrstuwy]|h[kmnrtu]|info|int|i[delmnoqrst]|jobs|j[emop]|k[eghimnprwyz]|l[abcikrstuvy]|mil|mobi|museum|m[acdghklmnopqrstuvwxyz]|name|net|n[acefgilopruz]|om|org|pro|p[aefghklmnrstwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|travel|t[cdfghjklmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]){1}"

    Dim validDomainExtensionsIgnoreCase As String = "(aero|arpa|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|travel){1}"
    Dim validDomainExtensionsUpperCase As String = "(A[CDEFGILMNOQRSTUWXZ]|B[ABDEFGHIJMNORSTVWYZ]|C[ACDFGHIKLMNORSUVXYZ]|D[EJKMOZ]|E[CEGHRSTU]|F[IJKMOR]|G[ABDEFGHILMNPQRSTUWY]|H[KMNRTU]|I[DELMNOQRST]|J[EMOP]|K[EGHIMNPRWYZ]|L[ABCIKRSTUVY]|M[ACDGHKLMNOPQRSTUVWXYZ]|N[ACEFGILOPRUZ]|OM|P[AEFGHKLMNRSTWY]|QA|R[EOUW]|S[ABCDEGHIJKLMNORTVYZ]|T[CDFGHJKLMNOPRTVWZ]|U[AGKMSYZ]|V[ACEGINU]|W[FS]|Y[ETU]|Z[AMW]){1}"

    Dim validDomainName As String = String.Concat("[\w\-_]+(\.[\w\-_]+)*[\.]{1}", _
                                                  validDomainExtensions, _
                                                  "([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?")

    Dim validDomainNameSinHTTP1 As String = String.Concat("[\w\-_]+(\.[\w\-_]+)*[\.]{1}", _
                                                  validDomainExtensionsIgnoreCase, _
                                                  "([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?")

    Dim validDomainNameSinHTTP2 As String = String.Concat("[\w\-_]+(\.[\w\-_]+)*[\.]{1}", _
                                                  validDomainExtensionsUpperCase, _
                                                  "([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?")

    Dim regxHTTP As New Regex(String.Concat("(http:\/\/|https:\/\/)+", validDomainName), RegexOptions.IgnoreCase)
    Dim regxSinHTTP1 As New Regex(String.Concat("(http:\/\/|https:\/\/){0}", validDomainNameSinHTTP1), RegexOptions.IgnoreCase)
    Dim regxSinHTTP2 As New Regex(String.Concat(validDomainNameSinHTTP2))

    Dim matchesHTTP As MatchCollection = regxHTTP.Matches(txtTweet.Text)
    Dim matchesSinHTTP1 As MatchCollection = regxSinHTTP1.Matches(txtTweet.Text)
    Dim matchesSinHTTP2 As MatchCollection = regxSinHTTP2.Matches(txtTweet.Text)

    textoSinUrls = regxHTTP.Replace(txtTweet.Text, "")
    textoSinUrls = regxSinHTTP1.Replace(textoSinUrls, "")
    textoSinUrls = regxSinHTTP2.Replace(txtTweet.Text, "")

    For Each match As Match In matchesHTTP
        txtUrlsDetectadas.Text = String.Concat(match.Value, vbNewLine, txtUrlsDetectadas.Text)

        If match.Value.Substring(0, 8) = "https://" Then
            NUrlsHTTPS += 1
        Else
            NUrlsHTTP += 1
        End If
    Next

    For Each match As Match In matchesSinHTTP1
        'It fails here, as match.Value is actually example.com if I actually typed http://excample.com'

        If match.Value.Substring(0, 7) <> "http://" Then
            txtUrlsDetectadas.Text = String.Concat(match.Value, vbNewLine, txtUrlsDetectadas.Text)
            NUrlsHTTP += 1
        End If
    Next

    For Each match As Match In matchesSinHTTP2
        If match.Value.Substring(0, 7) <> "http://" Then
            txtUrlsDetectadas.Text = String.Concat(match.Value, vbNewLine, txtUrlsDetectadas.Text)
            NUrlsHTTP += 1
        End If
    Next
4

2 回答 2

0

我会创建一个替代方案,第一种情况是 http(s)-case,第二种是非 http(s),我会检查 URL 是否包含域条件。像这样的东西:

(\A(?!=http)[your regex]|\A(?=.*\.(CO|ES|com|org))http[your regex])

(?=.*.(CO|ES|com|org)) 前瞻检查字符串中某处是否有 CO ES 等,而不会“吃掉它”。

于 2013-11-11T15:36:10.550 回答
0

最后我不得不将正则表达式与 Substring 结合起来,因为在正则表达式中找不到方法。更多信息在这里

于 2013-11-15T09:22:16.980 回答