python - Python 正则表达式交替

Question

我正在尝试以以下形式查找网页上的所有链接，"http://something"或者https://something.我制作了一个正则表达式并且它有效：

L = re.findall(r"http://[^/\"]+/|https://[^/\"]+/", site_str)

但是，有没有更短的方法来写这个？我重复了 ://[^/\"]+/ 两次，可能没有任何必要。我尝试了各种东西，但它不起作用。我试过：

L = re.findall(r"http|https(://[^/\"]+/)", site_str)
L = re.findall(r"(http|https)://[^/\"]+/", site_str)
L = re.findall(r"(http|https)(://[^/\"]+/)", site_str)

很明显我在这里遗漏了一些东西，或者我对 python 正则表达式的理解不够。

score 10 · Accepted Answer

您正在使用捕获组，并.findall()在使用它们时改变行为（它只会返回捕获组的内容）。您的正则表达式可以简化，但如果您使用非捕获组，您的版本将起作用：

L = re.findall(r"(?:http|https)://[^/\"]+/", site_str)

如果您在表达式周围使用单引号，则不需要转义双引号，并且您只需要改变s表达式中的，因此s?也可以：

L = re.findall(r'https?://[^/"]+/', site_str)

演示：

>>> import re
>>> example = '''
... "http://someserver.com/"
... "https://anotherserver.com/with/path"
... '''
>>> re.findall(r'https?://[^/"]+/', example)
['http://someserver.com/', 'https://anotherserver.com/']

python - Python 正则表达式交替

1 回答 1

Related

Reference