python - 如何使用爬虫或刮刀获取网站的所有网址？

Question

我必须从网站上获取许多 url，然后我必须将它们复制到一个 excel 文件中。我正在寻找一种自动的方法来做到这一点。该网站的结构有一个包含大约 300 个链接的主页，每个链接内部都有 2 或 3 个对我来说很有趣的链接。有什么建议么？

score 1 · Accepted Answer

如果你想用 Python 开发你的解决方案，那么我可以推荐Scrapy框架。

就将数据插入 Excel 工作表而言，有一些方法可以直接进行，例如，请参见此处：Insert row into Excel spreadsheet using openpyxl in Python，但您也可以将数据写入 CSV 文件然后导入它进入Excel。

score 1 · Accepted Answer

如果链接在 html 中...您可以使用漂亮的汤。这在过去对我有用。

import urllib2
from bs4 import BeautifulSoup

page = 'http://yourUrl.com'
opened = urllib2.urlopen(page)
soup = BeautifulSoup(opened)

for link in soup.find_all('a'):
    print (link.get('href'))

score 0 · Accepted Answer

可以使用美汤进行解析， [http://www.crummy.com/software/BeautifulSoup/]

有关此处文档的更多信息http://www.crummy.com/software/BeautifulSoup/bs4/doc/

我不会建议您使用斗志，因为您在问题中描述的工作不需要它。

例如，此代码将使用 urllib2 库打开 google 主页并以列表形式查找该输出中的所有链接

import urllib2
from bs4 import BeautifulSoup

data=urllib2.urlopen('http://www.google.com').read()
soup=BeautifulSoup(data)
print soup.find_all('a')

要处理 excel 文件，请查看http://www.python-excel.org

score 0 · Accepted Answer

你试过 selenium 还是 urllib？.urllib 比 selenium 快 http://useful-snippets.blogspot.in/2012/02/simple-website-crawler-with-selenium.html

python - 如何使用爬虫或刮刀获取网站的所有网址？

4 回答 4

Related

Reference