python - 从 Web 中提取数据

Question

一个真正的新手问题。我正在为我的家庭使用一个小的 python 脚本，它将收集特定机票的数据。

我想从 skyscanner 中提取数据（使用 BeautifulSoap 和 urllib）。例子：

http://www.skyscanner.net/flights/lond/rome/120922/120929/airfares-from-london-to-rome-in-september-2012.html

我对存储在这种元素中的所有数据感兴趣，特别是价格：http ://shrani.si/f/1w/An/1caIzEzT/capture.png

因为它们不在 HTML 中，我可以提取它们吗？

score 3 · Accepted Answer

我认为问题在于这些值是通过您的浏览器运行而urllib 不是运行的 javascript 代码呈现的 - 您应该使用可以执行 javascript 代码的库。

我刚刚搜索crawler python javascript了一下，我得到了一些建议使用selenium或webkit的 stackoverflow 问题和答案。您可以通过scrapy使用这些库。这里有两个片段：

使用 gtk/webkit/jswebkit 渲染/交互的 javascript

使用 Scrapy 和 Selenium RC 渲染的 Javascript 爬虫

score 1 · Accepted Answer

我一直在研究同样的问题。我被介绍给 Beautifulsoup，后来了解了 Scrapy。Beautifulsoup 非常易于使用，特别是如果您是新手。Scrapy 显然有更多的“功能”，但我相信你可以用 Beautifulsoup 完成你的需求。

我遇到了同样的问题，无法访问通过 Javascript 加载信息的网站，谢天谢地，Selenium 是救世主。

可以在这里找到对 Selenium 的精彩介绍。

安装：pip install selenium

下面是我整理的一个简单的类。您可以将其保存为 .py 文件并将其导入到您的项目中。如果您调用该方法retrieve_source_code(self, domain)并发送您尝试解析的超链接，它将返回完全加载页面的源代码，然后您可以将其放入 Beautifulsoup 并找到您要查找的信息！

前任：

airfare_url = 'http://www.skyscanner.net/flights/lond/rome/120922/120929/airfares-from-london-to-rome-in-september-2012.html'

soup = BeautifulSoup(SeleniumWebScraper.retrieve_source_code(airfare_url))

soup现在您可以像往常一样使用 Beautifulsoup进行解析。

希望对你有帮助！

from selenium import webdriver
import requests

class SeleniumWebScraper():

    def __init__(self):
        self.source_code = ''
        self.is_page_loaded = 0
        self.driver = webdriver.Firefox()
        self.is_browser_closed = 0
        # To ensure the page has fully loaded we will 'implicitly' wait 
        self.driver.implicitly_wait(10)  # Seconds

    def close(self):
        self.driver.close()
        self.clear_source_code()
        self.is_page_loaded = 0
        self.is_browser_closed = 1

    def clear_source_code(self):
        self.source_code = ''
        self.is_page_loaded = 0

    def retrieve_source_code(self, domain):
        if self.is_browser_closed:
            self.driver = webdriver.Firefox()
        # The driver.get method will navigate to a page given by the URL.
        #  WebDriver will wait until the page has fully loaded (that is, the "onload" event has fired)
        #  before returning control to your test or script.
        # It's worth nothing that if your page uses a lot of AJAX on load then
        #  WebDriver may not know when it has completely loaded.
        self.driver.get(domain)

        self.is_page_loaded = 1
        self.source_code = self.driver.page_source
        return self.source_code

score 0 · Accepted Answer

你甚至不需要 BeautifulSoup 来提取数据。

只需执行此操作，您的响应就会转换为非常易于处理的字典。

text = json.loads("你的主要响应内容的文本")

您现在可以打印字典中的任何键值对。试试看。这非常容易。

python - 从 Web 中提取数据

3 回答 3

Related

Reference