我想构建一个网络抓取实用程序,旨在捕获要查看的图像。一些图像包含透明层,因此设计为在特定颜色或纹理背景前查看。对于此类图像,我想截取屏幕截图,但将其裁剪为仅被抓取的图像,以便捕获具有相关背景的图像。
我正在查看 PyQt 的 QtWebKit 模块。对于那些熟悉的人,这个模块是否适合我的需求?或者也许不同的库或实用程序更适合这项任务?
我想构建一个网络抓取实用程序,旨在捕获要查看的图像。一些图像包含透明层,因此设计为在特定颜色或纹理背景前查看。对于此类图像,我想截取屏幕截图,但将其裁剪为仅被抓取的图像,以便捕获具有相关背景的图像。
我正在查看 PyQt 的 QtWebKit 模块。对于那些熟悉的人,这个模块是否适合我的需求?或者也许不同的库或实用程序更适合这项任务?
I would suggest looking at PhantomJS (http://phantomjs.org/). I picture the workflow being to use phantomjs to capture the entire page, as well as capture the data of the image position and size. Then use PIL (or even just GraphicsMagick) to crop the capture page down to just that image.
PhantomJS is programmed in javascript, but you should only need a few lines of JS code to load the page, find the image in it to query the size and position and snap the capture.
EDIT (in response to comment): Sure. You can use jQuery, or other tools of your choice. Here is a short example of phantomjs to open a page and get the size/position of an image in the page:
var page = require('webpage').create();
page.open(URL, function(status) {
var img_attr = page.evaluate(function(){
var el = $("img#SpecialID");
var result = el.offset(); // Returns top, left
result.width = el.width();
result.height = el.height();
return result;
});
console.log(img_attr); //Obviously, you'd want to write that to disk instead
page.render(OUTPUT_FILE);
});
So, if you fix up the console.log to write a record to disk, and add command line options for URL and OUTPUT_FILE, as well as maybe some error handling, and you will have a handy utility to call from your Python code.
我建议你spynner
在使用python =)
import spynner
browser = spynner.Browser()
browser.load("http://www.wordreference.com")
browser.snapshot( .... )
browser.close()