python - 如何使用 Python 查找（并抓取）给定域上的所有网页？

Question

我将如何抓取域以查找所有网页和内容？

例如：www.example.com、www.example.com/index.html、www.example.com/about/index.html 等等。

如果可能的话，我想用 Python 来做这件事，最好用 Beautiful Soup..

score 0 · Accepted Answer

你不能。页面不仅可以基于后端数据库数据和搜索查询或您的程序提供给网站的其他输入动态生成，而且有几乎无限的可能页面列表，并且知道哪些页面存在的唯一方法是测试看看。

您可以获得的最接近的方法是根据页面内容本身的页面之间的超链接来抓取网站。

score 0 · Accepted Answer

您可以使用 Python 库报纸
安装使用sudo pip3 install newspaper3k
您可以抓取特定网站上的所有文章。

from newspaper import Article
url = "http://www.example.com"
built_page = newspaper.build( url )
print("%d articles in %s\n\n"%(built_page.size(), url))
for article in built_page.articles:
    print(article.url)

从那里您可以使用Article 对象 API从页面获取各种信息，包括原始 HTML。

python - 如何使用 Python 查找（并抓取）给定域上的所有网页？

2 回答 2

Related

Reference