python - 使用 python re 查找包含 x 的 url

Question

使用 python 2.7.3、urllib 和 re，我正在寻找包含以下内容的 url：

href="/dirone/Dir_Two/dirthree/"

url 可能在哪里，例如：

href="/dirone/Dir_Two/dirthree/5678-random-stuff-here-letters-and-numbers"

我想回来：

"/dirone/Dir_Two/dirthree/5678-random-stuff-here-letters-and-numbers"

使用这个工具：

http://www.jslab.dk/tools.regex.php

我生成的正则表达式为：

/^href\="\/dirone\/Dir_Two\/dirthree\/"$/im

因此，此正则表达式是否可以通过以下方式与 python 和 re 一起使用：

object_name = re.findall('/^href\="\/dirone\/Dir_Two\/dirthree\/"$/im',url)
for single_url in object_name:
    do something

score 2 · Accepted Answer

你真的想放下^锚；我怀疑它href是否会出现在一行的开头。

您不需要该/im部分，应将其替换为re.标志常量。那里有 Perl 正则表达式语法，Python 没有专门的/.../flags语法。

因此有太多的转义并且没有实际的 Python 字符串。而且您实际上并没有包括该5678-random-stuff-here-letters-and-numbers部分。

改用这个：

object_name = re.findall(r'href="(/dirone/Dir_Two\/dirthree/[^"/]*)"', url, re.I)

我删除了多行标志，因为我们不再匹配已删除^. 我在路径周围添加了一个组 ( (...))，以便findall()返回这些而不是整个匹配。该[^"/]*部分匹配除引号或斜杠以外的任何字符，以捕获文件名部分但不匹配另一个目录名称。

简短演示：

>>> import re
>>> example = '<a href="/dirone/Dir_Two/dirthree/5678-random-stuff-here-letters-and-numbers">'
>>> re.findall(r'href="(/dirone/Dir_Two\/dirthree/[^"/]*)"', example, re.I)
['/dirone/Dir_Two/dirthree/5678-random-stuff-here-letters-and-numbers']

score 2 · Accepted Answer

类似于 Martijn 的答案，但beautifulsoup假设您拥有 HTML。

data = '<a href="/dirone/Dir_Two/dirthree/5678-random-stuff-here-letters-and-numbers">Content</a>'

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(data)
print [el['href'] for el in soup('a', href=re.compile('^/dirone/Dir_Two/.*'))]

python - 使用 python re 查找包含 x 的 url

2 回答 2

Related

Reference