我正在尝试使用此 python 代码从博客中提取链接:
#!/usr/bin/env python
"""
Extract all links from a web page
=================================
Author: Laszlo Szathmary, 2011 (jabba.laci@gmail.com)
Website: https://pythonadventures.wordpress.com/2011/03/10/extract-all-links-from-a-web-page/
GitHub: https://github.com/jabbalaci/Bash-Utils
Given a webpage, extract all links.
Usage:
------
./get_links.py <URL>
"""
import sys
import urllib
import urlparse
from BeautifulSoup import BeautifulSoup
class MyOpener(urllib.FancyURLopener):
version = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15'
def process(url):
myopener = MyOpener()
#page = urllib.urlopen(url)
page = myopener.open(url)
text = page.read()
page.close()
soup = BeautifulSoup(text)
for tag in soup.findAll('a', href=True):
tag['href'] = urlparse.urljoin(url, tag['href'])
print tag['href']
# process(url)
def main():
if len(sys.argv) == 1:
print "Jabba's Link Extractor v0.1"
print "Usage: %s URL [URL]..." % sys.argv[0]
sys.exit(1)
# else, if at least one parameter was passed
for url in sys.argv[1:]:
process(url)
# main()
#############################################################################
if __name__ == "__main__":
main()
链接来自主要类别为 blog.xx/Music/ 的博客,它将从 blog.xx/this_album_name/ 类别中提取链接,但我想从类别下的子页面上名为 quote 的类中获取链接
我如何解析音乐类别中的链接并让 BS 遍历每个标题链接以使用引用类提取下一页上的链接?
即 blog.xx/Category
blog.xx/post1.html
blog.xx/post2.html
在上述每个帖子页面上都有一个引用块,其中包含我想抓取的链接。
我是 python 和 BS 的新手,已经尝试了一些变化,但在这一点上我需要帮助。谢谢