javascript - 使用 Ghost.py 在 python 中屏幕抓取动态网页

Question

ghost = Ghost()
page, rcs = ghost.open(https://soundcloud.com/passionpit/sets/favorites)
page, rcs = ghost.wait_for_page_loaded()
songs = ghost.evaluate("document.getElementsByClassName('soundTitle__title');")
print songs

我正在尝试使用上面的代码在上面的页面上查找所有具有“soundTitle__title”类的html元素，但是到目前为止我的输出是

QFont::setPixelSize: Pixel size <= 0 (0)
({PyQt4.QtCore.QString(u'length'): 0.0}, [])

谁能帮我看看我的问题在哪里？当我document.getElementsByClassName('soundTitle__title')在浏览器控制台中运行时，我得到了我期望的输出，为什么 Python 输出不同？

或者有什么方法可以让我在 JavaScript 运行后使用 Ghost.py 或其他类似的库来获取页面的源代码（使用浏览器开发人员工具检查元素时看到的源代码）？

score 4 · Accepted Answer

我得到了这个工作，并建议使用Splinter，它基本上只是在引擎盖下运行 phantomjs 和 selenium。

您需要pip install splinter在您的机器上运行并安装 phantomjs，方法是下载/解压缩或者npm -g install phantomjs如果您有 npm 等。但总体而言，安装和依赖项是最小且简单的。

以下代码返回“Ryn Weaver - OctaHate”，我假设这是您正在寻找的内容，尽管没有更多上下文我不能完全确定。

from splinter import Browser

browser = Browser('phantomjs')
browser.visit('https://soundcloud.com/passionpit/sets/favorites')
songs = browser.find_by_xpath("//a[contains(@class, 'soundTitle__title')]")
if songs:
    for song in songs:
        print song.text
else:
    print "there aren't any songs"

您还会注意到我必须执行 xpath-contains 才能获得您正在寻找的类描述；因此，在尝试通过您使用的符号访问该类时，您可能会遇到问题 - 有一个 span 元素和一个锚元素，它们都包含“soundTitle__title”，但据我所知，只有 'a ' 元素有文本，我猜这就是你要找的。但如果你想要两者都可以browser.find_by_xpath("//*[contains(@class, 'soundTitle__title')]")

javascript - 使用 Ghost.py 在 python 中屏幕抓取动态网页

1 回答 1

Related

Reference