python - 从 Ghost.py 文件中获取信息

Question

我正在做一个项目，我需要从网页中获取信息。我正在使用python和ghost。我在文档中看到了这段代码：

links = gh.evaluate("""
                    var linksobj = document.querySelectorAll("a");
                    var links = [];
                    for (var i=0; i<linksobj.length; i++){
                        links.push(linksobj[i].value);
                    }
                    links;
                """)

这段代码绝对不是python。它是哪种语言，我可以在哪里学习如何配置它？如何从标签中找到一个字符串，例如。在：

标题>这是网页的标题

我怎样才能得到

这是页面的标题

谢谢。

score 1 · Accepted Answer

使用requests和beautifulSoup

import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.google.com/")
soup = BeautifulSoup(r.text)
soup.title.string
In [3]: soup.title.string
Out[3]: u'Google'

score 1 · Accepted Answer

ghost.py是一个 webkit 客户端。它允许您加载网页并与其 DOM 和运行时交互。

这意味着一旦你安装并运行了所有东西，你可以简单地这样做：

from ghost import Ghost
ghost = Ghost()
page, resources = ghost.open('http://stackoverflow.com/')
if page.http_status == 200:
    result, extra = ghost.evaluate('document.title;')
    print('The title is: {}'.format(result))

score 0 · Accepted Answer

编辑：在查看了 Padraic Cunningham 的答案后，在我看来，不幸的是我误解了你的问题。任何我如何留下我的答案以供将来参考或可能用于否决。:P

如果您收到的输出是一个字符串，那么 python 中的常见字符串操作可以实现您在问题中提到的所需输出。

你收到：title>this is title of the webpage

你渴望：this is title of the webpage

假设您收到的输出始终采用相同的格式，因此您可以执行以下字符串操作以获得所需的输出。使用拆分操作：

>>> s = 'title>this is title of the webpage'
>>> p = s.split('>')
>>> p
 ['title', 'this is title of the webpage']
>>> p[1]
'this is title of the webpage'

这p是一个列表，因此您必须访问包含所需输出的正确元素。

或者更简单的方法是创建一个子字符串。

>>> s = 'title>this is title of the webpage'
>>> p = s[6:]
>>> p
'this is title of the webpage'

p = s[6:]在上面的代码片段中意味着你想要一个字符串，它包含title>this is title of the webpage从第 7 个元素开始到结尾的所有内容。换句话说，您忽略了第一个6元素。

如果您收到的输出并不总是采用相同的格式，那么您可能更喜欢使用正则表达式。

您的第二个问题已经在评论部分得到解答。我希望我正确理解了你的问题。

python - 从 Ghost.py 文件中获取信息

3 回答 3

Related

Reference