python - 用于有条件地提取命名组的正则表达式

Question

我需要编写 python 风格的正则表达式来有条件地提取字段。以下是我需要从中提取的两种类型的测试字符串：

 http://domain/string1/path/field_to_extract/path/filename
 http://domain/string2/path/90020_10029/path/filename

以下是我的要求：

对于 string2，我们应该只选择第四个位置的数字，在斜杠 (/) 和 (_) 之间。
对于其他人，我们应该在第四个位置的斜线 (/) 之间选择整个文本。

我写了以下正则表达式：

(?i)^(?:[^ ]*(?: {1,2})){6}(?:[a-z]+://)(?:[^ /:]+[^ /]/:]+[^ /]+/[^ /]+/)?(?:[^ /]+/){2}(?P<field_name>(?<=/string2/)(?:[^/]+/)([^_]+)|((?<!/string2/)(?:[^/]+/)([^/]+)))

尽管条件提取似乎工作正常，但此正则表达式也匹配提取的字段之前的字符串。例如，当在第一个测试字符串上使用时，此正则表达式匹配path/field_to_extract，而在第二个测试字符串上匹配path/90020.

虽然我在必填字段之前添加了忽略组，但它似乎不起作用。

请帮助我正确使用正则表达式。

score 2 · Accepted Answer

使用 asplit()而不是 complegex 怎么样：-

s          = 'thelink'.split('/')
if len(s) > 4:
   string1or2 = s[3]
   field      = s[5]

   if string1or2 == 'string2':
       print field.split('_')[0]
else:
   raise ValueError("Incorrect URL")

score 2 · Accepted Answer

2

尝试模式'//[^/]+/[^/]+/[^/]+/(\d+(?=_)|[^/]+)'

于 2013-08-27T08:02:01.487 回答

score 0 · Accepted Answer

纯regex解决方案：

import re

urls = [
    r'''http://domain/string1/path/field_to_extract/path/filename''',
    r'''http://domain/string2/path/90020_10029/path/filename'''
]

for url in urls:
    print(re.search(r'(?<![:/])/(?:(string2)|[^/]*)/[^/]*/((?(1)[^_]*|[^/]*))', url).group(2))

解释：

(?<![:/])/:: 搜索不跟随另一个斜线或冒号的斜线。

(?:(string2)|[^/]*)/:: 匹配文字“string2”或任何其他东西。如果是第一个，则将其保存为 group-1 以便稍后执行有条件的yes-no-pattern 。

[^/]*/:: 匹配路径的第二部分。没有什么有趣的。

((?(1)[^_]*|[^/]*)):: 如果存在 group-1，匹配直到第一个_( [^_]*)。否则匹配直到下一个斜杠 ( [^/]*)。

它产生：

field_to_extract
90020

python - 用于有条件地提取命名组的正则表达式

3 回答 3

Related

Reference