0

我有一个 URL 列表,并且想为他们的每个网页抓取位置对象。我所指的数据是通过在浏览器控制台中输入“window.location”产生的。例如,使用 Chrome 在 www.github.com 上执行此操作会为您提供类似于以下输出的内容:

位置{分配:函数,替换:函数,重新加载:函数,祖先起源:DOMStringList,起源:“ https://github.com ”...}

展开后,您可以看到更多信息:

Location {
    ancestorOrigins: DOMStringList 
    assign: function () { [native code] } 
    hash: "" 
    host: "github.com" 
    hostname: "github.com" 
    href: "https://github.com/" 
    origin: "https://github.com" 
    pathname: "/" 
    port: "" 
    protocol: "https:" 
    reload: function () { [native code] } 
    replace: function () { [native code] } 
    search: "" 
    toString: function toString() { [native code] } 
    valueOf: function valueOf() { [native code] } 
    __proto__: Location  
}

我过去曾使用 Python 和 Mechanize 库进行抓取,但直到现在才需要此功能并且不知道如何继续。任何建议都会受到欢迎。

4

1 回答 1

1

As far as I understand, you want to perform a JavaScript call on desired web page. My suggestion would be to use some headless browsers. I did similar things with Framework called PyQt4. You can also use other headless web browsers like PhantomJS. Or you may also be interesting with tool called Selenium.

于 2013-06-25T00:33:16.517 回答