regex - 这个正则表达式是什么意思以及为什么

Question

sed "s/\(^[a-z,0-9]*\)\(.*\)\( [a-z,0-9]*$\)/\1\2 \1/g" desired_file_name

我很欣赏它，即使你只解释它的一部分，或者至少用词来构造它，如s\alphanumerical_at_start\something\alphanumerical_at_end\something_else\global

有人可以解释这意味着什么，为什么以及所有正则表达式都如此......糟糕？

我知道它用最后一个替换了第一个小写字母数字单词。但是你能解释一下这里发生了什么吗？所有的/\和$.*$\所有的东西是怎么回事？

我只是迷路了。

编辑：这是我得到(^[a-z0-9]*)的：从 z 槽和 0 槽 9 开始；并且[a-z,0-9]*$是相同的，但最后一个单词（但是[0-9,a-z]= 只是前 2 个字符/第一个字符，还是整个单词？）。另外： the*或$.*$\even 是什么意思？

score 2 · Accepted Answer

这是一个 sed 搜索和替换，作为 form s/search/replace/flags，唯一的标志是g表示搜索/替换是全局的，所以如果匹配在一行上出现多次，而不是只在第一行出现。

首先，这是它搜索的正则表达式：

\(^[a-z,0-9]*\)\(.*\)\( [a-z,0-9]*$\)

或者以更易读的格式：

\(             # start capture group 1
  ^              # match at the beginning of the line
  [a-z,0-9]*     # zero or more alphanumeric or comma characters (lowercase only)
\)             # end capture group 1
\(             # start capture group 2
  .*             # zero or more of any character (except for newlines)
\)             # end capture group 2
\(             # start capture group 3
  [ ]            # literal ' ' character (I added brackets for clarity)
  [a-z,0-9]*     # zero or more alphanumeric or comma characters (lowercase only)
  $              # match at the end of the line
\)             # end capture group 3

这是替换：

\1\2 \1

这将用捕获组 1 的内容替换整行（因为正则表达式中的^和$锚点），然后是捕获组 2 的内容，然后是空格，然后是捕获组 1 的内容。

score 1 · Accepted Answer

(^[az,0-9]) - 行首的字母数字或逗号（第 1 组）
(.) - 任意字符（第 2 组）
( [az,0-9]*$) - 一个空格，后跟零个或多个字母数字或逗号 [猜测逗号只是一个错误]，到行尾
\1\2 \1 - 替换为（第 1 组）（第 2 组）空格（第 1 组）
g - 输入中的任何地方

score 1 · Accepted Answer

正则表达式是一种描述正则文法的方法。他们以非常简洁和非常有效的方式实现了这一点。这使它们看起来很复杂。

它们也是结构化和可解码的。

首先，有一个sed电话。

sed "{operation}/{expression}/{replacement}/{modifiers}" {argument}

笔记

用sed正斜杠分隔部分。这意味着您不能在{expression}or中有未转义的正斜杠{replacement}。
与大多数其他正则表达式方言不同，sed它使用括号来匹配实际括号，并使用转义括号来定义捕获组。

{operation}碰巧是-s替代品。

是，它分解{expression}为$^[a-z,0-9]$$.*$$ [a-z,0-9]*$$

\( # 开始捕获组 1
  ^ # 匹配字符串的开头
  [az,0-9] # 匹配字符 az 和 0-9 以及逗号 (!)
\) # 结束捕获组 1
\( # 开始捕获组 2
  .* # 匹配任意字符 (.)，零次或多次 (*)
\) # 结束捕获组 2
\( # 开始捕获组 3
               # 匹配一个空格
  [az,0-9]* # 匹配字符 az 和 0-9 以及逗号 (!)
  $ # 匹配字符串的结尾
\) # 结束捕获组 3

想一想，编写一个功能相同的函数需要多少代码（和时间），以及正则表达式需要多少空间。这就是为什么它更难阅读的原因 - 它非常压缩。

{replacement}是\1\2 \1。_ \n称为反向引用，其中n是捕获组的编号。所以这会再次插入第 1 组和第 2 组的内容、一个空格和第 1 组的内容。

该{modifiers}部分是一个g标志，它使正则表达式尽可能频繁地应用。在这种特殊情况下，它没有多大意义，因为上面的正则表达式无论如何只能匹配一次。

score 1 · Accepted Answer

s/\(^[a-z,0-9]*\)\(.*\)\( [a-z,0-9]*$\)/\1\2 \1/g

s -> substitute
/ -> begin of regex
\( -> begin of a first field( accessed as \1 later)
^  -> from the begining of line in data
[a-z,0-9] -> list of characters which will be compared, lowercase a through z, comma, and 0 through 9
* -> zero or more times
\) -> end of \1 field
\( -> begin of \2
.* -> . means any character. .* means any character zero or more times
\) -> end of \2
\( [a-z,0-9]*$ -> begin of \3, followed by a space, follwed by zero or more a-z, comma, 0-9
\) -> end of \3 field
/ -> end of regex to replace

/ -> begin of regex to replace with
\1\2 \1 -> first field followed by second field followed by a space and again the first field
/ -> end of regex to replace with

g -> globally

regex - 这个正则表达式是什么意思以及为什么

4 回答 4

Related

Reference