0

请有人能帮我理解这个用于匹配HTML 中标签src属性的正则表达式吗?img

src=(?:(['""])(?<src>(?:(?!\1).)*)\1|(?<src>[^\s>]+))


src=                               this is easy
(?:(['""])(?<src>(?:(?!\1).)*)     ?: is unknown (['""]) matches either single or double quotes, followed by a named group "src" that matches unknown strings
\1                                 unknown
|                                  "or"
(?<src>[^\s>]+))                   named group "src" matches one or more of line start or whitespace

简而言之是什么?:意思?

(?:...)常规括号的非捕获版本也是如此。匹配括号内的任何正则表达式,但组匹配的子字符串在执行匹配后无法检索或稍后在模式中引用。

谢谢@mbratch

\1 是什么意思?

最后,感叹号在这里有什么特殊意义吗?(否定?)

4

5 回答 5

3

这可以帮助您理解正则表达式。

(?:(['""])((?:(?!\1).)*)\1|([^\s>]+))

正则表达式图片

在 Debuggex 上实时编辑

于 2013-05-29T13:29:23.437 回答
2

例如,考虑src="img.jpg"作为我们正在解析的文本

在正则表达式中,\1指的是第一个捕获组。在这种特殊情况下,第一个捕获组是(['""]). 该部分是在我们的示例(?:(['""])(?<src>(?:(?!\1).)*)中匹配的非捕获组。"img.jpg特别是,(['""])匹配任何引号字符。然后(?!\1)是对第一组中匹配的引号字符的负前瞻,因此(?:(?!\1).)匹配不是由第一组匹配的引号字符的任何字符,并(?<src>(?:(?!\1).)*)在命名的捕获组中匹配结束引号字符之前的字符序列。然后以下\1匹配结束引号字符。

于 2013-05-29T13:34:15.620 回答
2
src=      # matches literal "src="
(?:       # the ?: suppresses capturing. generally a good practice if capturing
          # is not explicitly necessary
  (['"])  # matches either ' or ", and captures what was matched in group 1
          # (because this is the first set of parentheses where capturing is not
          # suppressed)
  (?<src> # start another (named) capturing group with the name "src"
    (?:   # start non-capturing group
      (?!\1)
          # a negative lookahead, if its contents match, the lookahead causes the
          # pattern to fail
          # the \1 is a backreference and matches what was matched in capturing
          # group no. 1
    .)*   # match any character, end of non-capturing group, repeat
          # summary of this non-capturing group: for each character, check that
          # it is not the kind of quote we matched at the start. if it's not,
          # then consume it. repeat as long as possible.

  )       # end of capturing group "src"
  \1      # again a backreference to what was matched inside capturing group 1
          # i.e. match the same kind of quote that started the attribute value
|         # or
  (?<src> # again a capturing group with the name "src"
    [^\s>]+
          # match as many non-space, non-> character as possible (at least one)
  )       # end of capturing group. this case treats unquoted attribute values.
)         # end of non-capturing group (which was used to group the alternation)

一些进一步的阅读:

如果您想稍微更新一下您的正则表达式知识,我建议您通读整个教程。这绝对值得你花时间。

还有一些资源可以帮助您理解复杂的表达式:

  • 正则表达式 101从正则表达式生成解释。但是,它使用 PHP 的 PCRE 引擎,因此它会阻塞某些 .NET 功能,例如重复命名的捕获组(在您的情况下为 .NET src)。
  • Debuggex可让您逐步执行正则表达式并生成流程图。到目前为止,它的正则表达式风格更加有限(对于 JavaScript 的 ECMAScript 风格)
  • Regexper专注于流程图。不过,到目前为止,它也仅限于 JavaScript 正则表达式。
于 2013-05-29T13:35:59.063 回答
1

1>它首先捕获['""]组1中的任何一个,即(['""])

2>然后它将0匹配到许多字符,这不是第1组中捕获的字符,即(?:(?!\1).)*

3>它执行第2步,直到它与第1组中捕获的匹配,即\1

以上3个步骤类似(['""])[^\1]*\1

或者

1>它匹配所有非空格,> src=ie之后的字符[^\s>]+


注意 我会使用src=(['""]).*?\1

.*很贪心,它尽可能匹配..

.*?是懒惰的,它尽可能少地匹配..

例如,考虑这个字符串hello hi world

对于正则表达式^h.*l输出将是hello hi worl

对于正则表达式^h.*?l输出将是hel

于 2013-05-29T13:29:36.770 回答
1

我使用 RegexBuddy 来获得这个输出:

Match the characters “src=” literally «src=»
Match the regular expression below «(?:(['""])(?<src>(?:(?!\1).)*)\1|(?<src>[^\s>]+))»
   Match either the regular expression below (attempting the next alternative only if this one fails) «(['""])(?<src>(?:(?!\1).)*)\1»
      Match the regular expression below and capture its match into backreference number 1 «(['""])»
         Match a single character present in the list “'"” «['""]»
      Match the regular expression below and capture its match into backreference with name “src” «(?<src>(?:(?!\1).)*)»
         Match the regular expression below «(?:(?!\1).)*»
            Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
            Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!\1)»
               Match the same text as most recently matched by capturing group number 1 «\1»
            Match any single character that is not a line break character «.»
      Match the same text as most recently matched by capturing group number 1 «\1»
   Or match regular expression number 2 below (the entire group fails if this one fails to match) «(?<src>[^\s>]+)»
      Match the regular expression below and capture its match into backreference with name “src” «(?<src>[^\s>]+)»
         Match a single character NOT present in the list below «[^\s>]+»
            Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
            A whitespace character (spaces, tabs, line breaks, etc.) «\s»
            The character “&gt;” «>»

对于您所描述的内容,这个正则表达式非常糟糕。src="是一个有效的输入。

于 2013-05-29T13:29:50.780 回答