1

我正在查看其他人的旧代码并且在理解它时遇到了一些麻烦。

他有:

explode(' ', strtolower(preg_replace('/[^a-z0-9-]+/i', ' ', preg_replace('/\&#?[a-z0-9]{2,4}\;/', ' ', preg_replace('/<[^>]+>/', ' ', $texts)))));

我认为第一个正则表达式排除a-zand 0-9,但我不确定第二个正则表达式是做什么的。第三个匹配除了里面的任何'< >'东西'>'

结果将输出一个包含$texts变量中每个单词的数组,但是,我只是不知道代码是如何产生这个的。我确实了解preg_replace其他功能的作用,只是不知道该过程是如何工作的

4

2 回答 2

4

该表达式/[^a-z0-9-]+/i将匹配(并随后用空格替换)az 和 0-9 之外的任何字符。^in [^...]表示否定其中包含的字符集。

  • [^a-z0-9]匹配任何字母数字字符
  • +指上述一项或多项
  • /i使其不区分大小写匹配

表达式/\&#?[a-z0-9]{2,4}\;/匹配 a&后接可选#, 后接两到四个字母和数字,以 a 结尾;这将匹配 HTML 实体,如 &nbsp;or&#39;

  • &#?匹配&&#匹配?前面的#可选 The&实际上不需要转义。
  • [a-z0-9]{2,4}匹配两到四个字母数字字符
  • ;是文字分号。它实际上不需要转义。

部分正如您所怀疑的那样,最后一个将替换任何标签,如<tagname>or<tagname attr='value'></tagname>用空白空间。请注意,它匹配整个标签,而不仅仅是<>.

  • <是文字字符
  • [^>]+是每个字符直到但不包括下一个>
  • >是文字字符

我真的建议将其重写为三个单独的调用preg_replace()而不是嵌套它们。

// Strips tags.  
// Would be better done with strip_tags()!!
$texts = preg_replace('/<[^>]+>/', ' ', $texts);
// Removes HTML entities
$texts = preg_replace('/&#?[a-z0-9]{2,4};/', ' ', $texts);
// Removes remainin non-alphanumerics
$texts = preg_replace('/[^a-z0-9-]+/i', ' ', $texts);
$array = explode(' ', $texts);
于 2013-03-19T23:30:57.590 回答
2

这段代码看起来...

  1. 去除 HTML/XML 标记(< 和 > 之间的任何内容)
  2. 然后任何以 & 或 开头且长度为 2-4 个字符(字母数字)的内容
  3. 然后去掉任何不是字母数字或破折号的东西

按嵌套处理顺序

/<[^>]+>/

Match the character “&lt;” literally «<»
Match any character that is NOT a “&gt;” «[^>]+»
   Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the character “&gt;” literally «>»


/\&#?[a-z0-9]{2,4}\;/

Match the character “&amp;” literally «\&»
Match the character “#” literally «#?»
   Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match a single character present in the list below «[a-z0-9]{2,4}»
   Between 2 and 4 times, as many times as possible, giving back as needed (greedy) «{2,4}»
   A character in the range between “a” and “z” «a-z»
   A character in the range between “0” and “9” «0-9»
Match the character “;” literally «\;»


/[^a-z0-9-]+/i

Options: case insensitive

Match a single character NOT present in the list below «[^a-z0-9-]+»
   Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
   A character in the range between “a” and “z” «a-z»
   A character in the range between “0” and “9” «0-9»
   The character “-” «-»
于 2013-03-19T23:34:10.073 回答