regex - 优化城市和地址捕获正则表达式

Question

我正在处理一个包含多行数据的文本文件。我得到的格式很讨厌，但它是一致的，这就是我想在这里使用 RegEx 的原因。

每个属性都以空格分隔（5 个空格），从州开始，然后是城市，然后是用户类型，然后是用户地址（然后是他们在该地址的年限），然后是 GUID。出于安全目的，我修改了地址，但每一行都遵循相同的格式：

[{     OH     Crestline     Reseller     (1234 Alvarez Dr., 4)     a6fa960c-921a-40e6-a5ab-30cc7fb83907     }]
[{     AZ     Marana     Distributor     (1234 Union St., >1)     1f2a9252-cbac-4e17-8d4c-d5eaebb5f6b7     }]
[{     MI     Lansing     Reseller     (1234 Westmore Ave., 11)     5736c1c0-2e23-43cd-8765-c48fbe51ffee     }]

我在这里感兴趣的是捕捉城市和带有年数的地址。我编写了以下正则表达式来实现这一点：

\[\{[ ]{5}[A-Z]{1,}[ ]{5}([A-Za-z]{1,})[ ]{5}(?:Reseller|Distributor){1,}[ ]{5}\(([0-9]{1,}[ ][A-Za-z]{1,}[ ][A-Za-z.,]{1,}[ ][>0-9]{1,})

使用上面的表达式和示例数据的第一行，RegExCrestline在第一组和1234 Alvarez Dr., 4第二组中捕获。

我的问题：

有没有一种更简洁或更简洁的方式来编写这个表达式，以便它仍然可以从行中捕获这两条信息？

score 1 · Accepted Answer

你可以像这样更短更有效的表达方式：

\[\{\s{5}[A-Z]+\s{5}(\w+)[^\(]+\(([^,]+),[^0-9]+([0-9]+)\)[^\}]+\}\]

这将捕获第 1 组中的城市名称、第 2 组中的街道地址以及他/她在第 3 组中在该地址度过的年数。

score 1 · Accepted Answer

我会使用：

\[\{\s{5}[A-Z]{2}\s{5}(.+?)\s{5}.+?\s{5}\(([^)]+)\)

城市将在第 1 组中，地址和年份将在第 2 组中。

解释：

The regular expression:

(?-imsx:\[\{\s{5}[A-Z]{2}\s{5}(.+?)\s{5}.+?\s{5}\(([^)]+)\))

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  \[                       '['
----------------------------------------------------------------------
  \{                       '{'
----------------------------------------------------------------------
  \s{5}                    whitespace (\n, \r, \t, \f, and " ") (5
                           times)
----------------------------------------------------------------------
  [A-Z]{2}                 any character of: 'A' to 'Z' (2 times)
----------------------------------------------------------------------
  \s{5}                    whitespace (\n, \r, \t, \f, and " ") (5
                           times)
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    .+?                      any character except \n (1 or more times
                             (matching the least amount possible))
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  \s{5}                    whitespace (\n, \r, \t, \f, and " ") (5
                           times)
----------------------------------------------------------------------
  .+?                      any character except \n (1 or more times
                           (matching the least amount possible))
----------------------------------------------------------------------
  \s{5}                    whitespace (\n, \r, \t, \f, and " ") (5
                           times)
----------------------------------------------------------------------
  \(                       '('
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    [^)]+                    any character except: ')' (1 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------
  \)                       ')'
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

score 1 · Accepted Answer

您说格式是一致的，因此您可以从模式中删除格式验证。从数据的类型来看，您可能还可以假设(除了地址之前不会出现在任何地方。在这种情况下，您可以将其压缩很多：

[ ]{5}.+?[ ]{5}([^ ]+).+\(([^)]+)

分解：

[ ]{5}.+?[ ]{5}- 跳过 2 个单独的组，每组 5 个空格（中间不要贪婪，以确保它只是前两组）
([^ ]+)- 捕获一组非空格字符（这是城市）
.+\(- 向前跳直到(找到a
([^)]+)- 在括号内捕获（这是带有年份的地址）

regex - 优化城市和地址捕获正则表达式

3 回答 3

Related

Reference