python - 改进机器人正则表达式

Question

我为提取机器人链接制作了以下正则表达式：

re.compile(r"/\S+(?:\/+)")

我得到以下结果：

/includes/
/modules/
/search/
/?q=user/password/
/?q=user/register/
/node/add/
/logout/
/?q=admin/
/themes/
/?q=node/add/
/admin/
/?q=comment/reply/
/misc/
//example.com/
//example.com/site/
/profiles/
//www.robotstxt.org/wc/
/?q=search/
/user/password/
/?q=logout/
/comment/reply/
/?q=filter/tips/
/?q=user/login/
/user/register/
/user/login/
/scripts/
/filter/tips/
//www.sxw.org.uk/computing/robots/

如何排除具有两个斜杠的链接，例如：

 //www.sxw.org.uk/computing/robots/
 //www.robotstxt.org/wc/
 //example.com/
 //example.com/site/

有任何想法吗？？

score 1 · Accepted Answer

1

我建议只添加一个if条件：

 if not line.startswith(r'//'):
     #then do something here

于 2012-07-02T07:38:32.133 回答

score 1 · Accepted Answer

假设要匹配的字符串出现在示例中的每一行上，我们可以锚定正则表达式并使用负前瞻

^(?!//)/\S+(?:\/+)

请务必设置使 ^ 匹配行首的正则表达式修饰符。

我的 Python 生锈了，但应该这样做

for match in re.finditer(r"(?m)^(?!//)/\S+(?:/+)", subject):
    # match start: match.start()
    # match end (exclusive): match.end()
    # matched text: match.group()

python - 改进机器人正则表达式

2 回答 2

Related

Reference