我有一个 Python 程序,可以打印出某个站点的链接。它看起来像这样:
import urllib
import re
import mechanize
import urlparse
url = "http://sparkbrowser.com"
#Mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Firefox')]
br.open(url)
for link in br.links():
newurl = urlparse.urljoin(link.base_url, link.url)
b1 = urlparse.urlparse(newurl).hostname
b2 = urlparse.urlparse(newurl).path
wholeLink = "http://"+b1+b2
linkTxt = link.text
print wholeLink
print linkTxt
这给了我这样的输出(为此我缩短了结果):
http://twitter.com/sparkbrowser
Twitter[IMG]
http://facebook.com/sparkbrowser
Facebook[IMG]
http://www.flickr.com/photos/sparkbrowser
Flickr[IMG]
http://youtube.com/sparkbrowser
Youtube[IMG]
http://vimeo.com/user7123627
Vimeo[IMG]
http://plus.google.com/103169821052890438536
Google[IMG]
http://sparkbrowser.com/index.php
Home
http://sparkbrowser.com/download.php
Download
http://sparkbrowser.com/about.php
About
如何排除那些包含[IMG]
在文本中的结果?
我已经尝试过regex
,.search()
但我失败了。我需要类似 iflink.text != ('*[IMG]')
打印出来的东西,但我不知道如何正确实现它......
欢迎任何建议!