python - 从txt文件解析url

Question

我正在尝试解析一个看起来像这样的 txt 文件：

Disallow: /cyberworld/map/ # This is an infinite virtual URL space
Disallow: /tmp/ # these will soon disappear
Disallow: /foo.html

我需要阅读文件并在“禁止”之后提取带有 url 的部分，但也忽略评论。提前致谢。

score 5 · Accepted Answer

如果您尝试解析robots.txt文件，那么您应该使用robotsparser模块：

>>> import robotparser

>>> r = robotparser.RobotFileParser()
>>> r.set_url("http://www.your_url.com/robots.txt")
>>> r.read()

然后只需检查：

>>> r.can_fetch("*", "/foo.html")
False

score 1 · Accepted Answer

假设#URL 中没有：

with open('path/to/file') as infile:
    URLs = [line.strip().lstrip("Disallow:").split("#", 1)[0] for line in infile]

允许存在#，但假设以#和开头的评论由空格分隔：

with open('path/to/file') as infile:
    URLs = [line.strip().lstrip("Disallow:").split(" #", 1)[0] for line in infile]

2 回答 2