regex - PCRE 正则表达式语法

Question

我想这或多或少是一个由两部分组成的问题，但首先是基础知识：我正在编写一些 PHP 以使用 preg_match_all 来查找以 {} 结尾的字符串的变量。然后它遍历每个返回的字符串，用 MySQL 查询中的数据替换它找到的字符串。

第一个问题是：有什么好的网站可以真正了解 PCRE 表达式的来龙去脉吗？我在谷歌上做了很多搜索，但到目前为止我能找到的最好的是http://www.regular-expressions.info/。在我看来，那里的信息没有很好的组织，因为我不想在需要编写复杂的正则表达式时寻求帮助，所以请指点我几个网站（或几本书！）这将有助于我以后不必打扰你们。

第二个问题是：我有这个正则表达式

"/{.*(_){1}(.*(_){1}[a-z]{1}|.*)}/"

我需要它来捕获诸如{first_name}, {last_name}, {email}等之类的实例。这个正则表达式有三个问题。

第一个是它将“ {first_name} {last_name}”视为一个字符串，而它应该将其视为两个。我已经能够通过检查空间的存在来解决这个问题，然后在空间上爆炸。凌乱，但它的工作原理。

第二个问题是它包含标点符号作为捕获字符串的一部分。因此，如果您有“ {first_name} {last_name}，”，那么它将逗号作为字符串的一部分返回。我已经能够通过简单地使用 preg_replace 删除句点、逗号和分号来部分解决这个问题。虽然它适用于那些标点符号项目，但我的逻辑无法处理感叹号、问号和其他所有内容。

我对这个正则表达式的第三个问题是它根本看不到 {email} 的实例。

现在，如果您可以、愿意并且有时间简单地将这个问题的解决方案交给我，谢谢您，因为这将解决我眼前的问题。但是，即使您可以做到这一点，也请提供一个 lmgfty 提供良好的网站作为参考和/或一两本书，以提供有关该主题的良好教育。由于资金紧张，站点会更可取，但如果一本书是解决方案，我会找到钱（假设我当地的图书馆系统无法获得上述数量）。

score 3 · Accepted Answer

Back then I found PHP's own PCRE syntax reference quite good: http://uk.php.net/manual/en/reference.pcre.pattern.syntax.php

Let's talk about your expression. It's quite a bit more verbose than necessary; I'm going to simplify it while we go through this.

A rather simpler way of looking at what you're trying to match: "find a {, then any number of letters or underscores, then a }". A regular expression for that is (in PHP's string-y syntax): '/\{[a-z_]+\}/'

This will match all of your examples but also some wilder ones like {__a_b}. If that's not an option, we can go with a somewhat more complex description: "find a {, then a bunch of letters, then (as often as possible) an underscore followed by a bunch of letters, then a }". In a regular expression: /\{([a-z]+(_[a-z]+)*\}/

This second one maybe needs a bit more explanation. Since we want to repeat the thing that matches _foo segments, we need to put it in parentheses. Then we say: try finding this as often as possible, but it's also okay if you don't find it at all (that's the meaning of *).

So now that we have something to compare your attempt to, let's have a look at what caused your problems:

Your expression matches any characters inside the {}, including } and { and a whole bunch of other things. In other words, {abcde{_fgh} would be accepted by your regex, as would {abcde} fg_h {ijkl}.
You've got a mandatory _ in there, right after the first .*. The (_){1} (which means exactly the same as _) says: whatever happens, explode if this ain't here! Clearly you don't actually want that, because it'll never match {email}.

Here's a complete description in plain language of what your regex matches:

Match a {.
Match a _.
Match absolutely anything as long as you can match all the remaining rules right after that anything.
Match a _.
Match a single letter.
Instead of that _ and the single letter, absolutely anything is okay, too.
Match a }.

This is probably pretty far from what you wanted. Don't worry, though. Regular expressions take a while to get used to. I think it's very helpful if you think of it in terms of instructions, i.e. when building a regular expression, try to build it in your head as a "find this, then find that", etc. Then figure out the right syntax to achieve exactly that.

This is hard mainly because not all instructions you might come up with in your head easily translate into a piece of a regular expression... but that's where experience comes in. I promise you that you'll have it down in no time at all... if you are fairly methodical about making your regular expressions at first.

Good luck! :)

score 1 · Accepted Answer

对于 PCRE，我只是简单地消化了 PCRE 联机帮助页，但无论如何我的大脑都是这样工作的......

至于匹配分隔的东西，你通常有两种方法：

匹配第一个定界符，匹配任何不是结束定界符的，匹配结束定界符。
匹配第一个分隔符，不贪婪地匹配任何东西，匹配结束分隔符。

例如，对于您的情况：

\{([^}]+)\}
\{(.+?)\}- 注意？在+之后

我围绕您可能想要提取的内容添加了一个组。

另请注意，特别是在#1 的情况下，而且对于#2，如果“点匹配任何东西”有效（dotall、singleline 或任何你最喜欢的正则表达式风格调用它），它们也会匹配内部的换行符 - 你会如果这会成为问题，则需要手动排除它以及您不想要的任何其他内容；如果您想要更像白名单方法的东西，请参阅上面的答案。

score 0 · Accepted Answer

这是一个很好的正则表达式站点。
这是一个可以工作的 PCRE 正则表达式：\{\w+\}

它是这样工作的：它基本上是在寻找，{然后one ore more word characters是}. 有趣的是，单词字符类实际上也包含下划线。\w本质上是简写[A-Za-z0-9_]

所以它基本上会匹配大括号内这些字符的任何组合，因为加号只会匹配非空的大括号。

regex - PCRE 正则表达式语法

3 回答 3

Related

Reference