python - 繁重的正则表达式 - 真的很耗时

Question

我已经按照正则表达式来检测 html 文件中的开始和结束脚本标签：

<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:[^s]))*)</script>

简而言之：<script NOT</s > NOT</s </script>

它可以工作，但需要很长时间来检测 <script>，对于长字符串甚至需要几分钟或几小时

精简版即使对于长字符串也能完美工作：

<script[^<]*>[^<]*</script>

但是，我也将扩展模式用于其他标签，例如 <a> 其中 < 和 > 可以作为属性值

为您测试python：

import re
pattern = re.compile('<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:^s]))*)</script>', re.I + re.DOTALL)
re.search(pattern, '11<script type="text/javascript"> easy>example</script>22').group()
re.search(pattern, '<script type="text/javascript">' + ('hard example' * 50) + '</script>').group()

我该如何解决？正则表达式的内部部分（在 <script> 之后）应该被改变和简化。

PS :) 期待您对错误方法的回答，例如在 html 解析中使用正则表达式，我非常了解许多 html/xml 解析器，甚至更好的是我可以在经常损坏的 html 代码中期待什么，正则表达式在这里非常有用。

评论：好吧，我需要句柄：
每个 <a < document like this.border="5px;">
和方法是一起使用解析器和正则表达式 BeautifulSoup 只有 2k 行，它不能处理每个 html 并且只是从 sgmllib 扩展正则表达式。

主要原因是我必须准确知道每个标签开始和停止的位置。并且必须处理每个损坏的 html。
BS 并不完美，有时会发生：
BeautifulSoup('< scriPt\n\n>a<aa>s< /script>').findAll('script') == []

@Cylian：你知道的原子分组不是在python的re中可用。
一切都那么令人讨厌。*？直到此时<\s*/\s*tag\s*>是赢家。

我知道在这种情况下并不完美： re.search('<\s*script. ?<\s /\s*script\s*>','< script </script> shit </script>') .group() 但我可以在下一次解析中处理被拒绝的尾部。

很明显，使用正则表达式进行 html 解析并不是一场战斗。

score 3 · Accepted Answer

使用像 beautifulsoup 这样的 HTML 解析器。

请参阅“我可以使用 beautifulsoup 删除脚本标签吗？”的最佳答案。

如果您唯一的工具是锤子，那么每个问题都开始看起来像钉子。正则表达式是一把强大的锤子，但并不总是解决某些问题的最佳解决方案。

我猜您出于安全原因想从用户发布的 HTML 中删除脚本。如果安全是主要问题，那么正则表达式很难实现，因为黑客可以修改很多东西来欺骗你的正则表达式，但大多数浏览器会很乐意评估......一个专门的解析器更容易使用，性能更好并且更安全.

如果您仍在思考“为什么我不能使用正则表达式”，请阅读mayhewr评论指出的这个答案。我不能说得更好，这家伙把它钉牢了，他的 4433 票是当之无愧的。

score 2 · Accepted Answer

我不知道python，但我知道正则表达式：

如果您使用贪婪/非贪婪运算符，您会得到一个更简单的正则表达式：

<script.*?>.*?</script>

这是假设没有嵌套脚本。

score 0 · Accepted Answer

模式的问题在于它是回溯。使用原子团可以解决这个问题。将您的模式更改为此**

<script(?>[^<]+?|<(?:[^/]|/(?:[^s])))*>(?>[^<]+|<(?:[^/]|/(?:[^s]))*)</script>   
         ^^^^^                           ^^^^^

解释

<!--
<script(?>[^<]+?|<(?:[^/]|/(?:[^s])))*>(?>[^<]+|<(?:[^/]|/(?:[^s]))*)</script>

Match the characters “&lt;script” literally «<script»
Python does not support atomic grouping «(?>[^<]+?|<(?:[^/]|/(?:[^s])))*»
   Match either the regular expression below (attempting the next alternative only if this one fails) «[^<]+?»
      Match any character that is NOT a “&lt;” «[^<]+?»
         Between one and unlimited times, as few times as possible, expanding as needed (lazy) «+?»
   Or match regular expression number 2 below (the entire group fails if this one fails to match) «<(?:[^/]|/(?:[^s]))»
      Match the character “&lt;” literally «<»
      Match the regular expression below «(?:[^/]|/(?:[^s]))»
         Match either the regular expression below (attempting the next alternative only if this one fails) «[^/]»
            Match any character that is NOT a “/” «[^/]»
         Or match regular expression number 2 below (the entire group fails if this one fails to match) «/(?:[^s])»
            Match the character “/” literally «/»
            Match the regular expression below «(?:[^s])»
               Match any character that is NOT a “s” «[^s]»
Match the character “&gt;” literally «>»
Python does not support atomic grouping «(?>[^<]+|<(?:[^/]|/(?:[^s]))*)»
   Match either the regular expression below (attempting the next alternative only if this one fails) «[^<]+»
      Match any character that is NOT a “&lt;” «[^<]+»
         Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
   Or match regular expression number 2 below (the entire group fails if this one fails to match) «<(?:[^/]|/(?:[^s]))*»
      Match the character “&lt;” literally «<»
      Match the regular expression below «(?:[^/]|/(?:[^s]))*»
         Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
         Match either the regular expression below (attempting the next alternative only if this one fails) «[^/]»
            Match any character that is NOT a “/” «[^/]»
         Or match regular expression number 2 below (the entire group fails if this one fails to match) «/(?:[^s])»
            Match the character “/” literally «/»
            Match the regular expression below «(?:[^s])»
               Match any character that is NOT a “s” «[^s]»
Match the characters “&lt;/script>” literally «</script>»
-->

python - 繁重的正则表达式 - 真的很耗时

3 回答 3

Related

Reference