python - Python的robotparser忽略站点地图

Question

我有以下 robots.txt

User-agent: *
Disallow: /images/
Sitemap: http://www.example.com/sitemap.xml

和下面的机器人解析器

def init_robot_parser(URL):
    robot_parser = robotparser.RobotFileParser()
    robot_parser.set_url(urlparse.urljoin(URL, "robots.txt"))
    robot_parser.read()

    return robot_parser

但是当我做一个print robot_parser最重要的事情时，return robot_parser我得到的是

User-agent: *
Disallow: /images/

为什么它忽略了站点地图行，我错过了什么吗？

score 3 · Accepted Answer

Sitemap 是标准的扩展，robotparser 不支持它。您可以在源代码中看到它只处理“user-agent”、“disallow”和“allow”。对于其当前功能（告诉您是否允许特定 URL），无需了解站点地图。

score 1 · Accepted Answer

您可以使用 Repply ( https://github.com/seomoz/reppy ) 来解析 Robots.txt - 它处理站点地图。

但请记住，在某些情况下，默认位置 (/sitemaps.xml) 上有一个站点地图，并且站点所有者没有在 robots.txt 中提及它（例如在 toucharcade.com 上）

我还发现了至少一个站点地图被压缩的站点——robot.txt 指向一个 .gz 文件。

python - Python的robotparser忽略站点地图

2 回答 2

Related

Reference