我正在修改这个脚本来为书页图像抓取这样的页面。直接从stackoverflow使用脚本,它会正确返回所有图像,除了我想要的一张图像。该页面作为空文件返回,其标题如下:img.php?dir=39d761947ad84e71e51e3c300f7af8ff&file=1.png。
在我下面的修改版本中,我只拉书页图像。
这是我的脚本:
from bs4 import BeautifulSoup as bs
import urlparse
from urllib2 import urlopen
from urllib import urlretrieve
import os
import sys
out_folder = '/Users/Craig/Desktop/img'
def main(url, out_folder):
soup = bs(urlopen(url))
parsed = list(urlparse.urlparse(url))
for image in soup.findAll('img', id='page_image'):
print "Image: %(src)s" % image
filename = image["src"].split("/")[-1]
parsed[2] = image["src"]
outpath = os.path.join(out_folder, filename)
if image["src"].lower().startswith("http"):
urlretrieve(image["src"], outpath)
else:
urlretrieve(urlparse.urlunparse(parsed), outpath)
def _usage():
print "usage: python dumpimages.py http://example.com [outpath]"
if __name__ == "__main__":
url = sys.argv[-1]
if not url.lower().startswith("http"):
out_folder = sys.argv[-1]
url = sys.argv[-2]
if not url.lower().startswith("http"):
_usage()
sys.exit(-1)
main(url, out_folder)
有任何想法吗?