0

我正在尝试抓取一个金融网站来制作一个应用程序来比较来自其他各种网站(谷歌/雅虎金融)的金融数据的准确性。

我试图抓取的 URL(特别是股票的“关键数据”,如市值、交易量等)在这里:

https://www.marketwatch.com/investing/stock/sbux

我已经弄清楚(在其他人的帮助下)必须构建一个 cookie 并随每个请求一起发送,以便页面显示数据(否则页面 html 响应几乎返回空)。

我使用 Opera/Firefox/Chrome 浏览器查看从浏览器发回的 HTTP 标头和请求。我得出的结论是,需要完成 3 个步骤/请求来接收所有 cookie 数据并逐个构建它。

步骤/请求 1

只需访问上面的 URL。

GET /investing/stock/sbux HTTP/1.1
Host: www.marketwatch.com:443
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.44

HTTP/1.1 200 OK
Cache-Control: max-age=0, no-cache, no-store
Connection: keep-alive
Content-Length: 579
Content-Type: text/html; charset=utf-8
Date: Sun, 26 Aug 2018 05:12:16 GMT
Expires: Sun, 26 Aug 2018 05:12:16 GMT
Pragma: no-cache

步骤/请求 2

我不确定这个“POST”URL 来自哪里。但是,使用 Firefox 并查看网络连接时,此 URL 会在“跟踪堆栈”选项卡中弹出。同样,如果每个人都相同或随机创建,我不知道从哪里获取此 URL。我也不知道正在发送什么 POST 数据或 X-Hash-Result 或 X-Token-Value 的值来自哪里。但是,此请求在响应标头中返回一个非常重要的值,其中包含以下行:'Set-Cookie: ncg_g_id_zeta=701c19ee3f45d07b56b40fb8e313214d'这部分 cookie 对于下一个请求至关重要,以便返回完整的 cookie 并接收数据网页。

POST /149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint HTTP/1.1
Host: www.marketwatch.com:443
Accept: */*
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Content-Type: application/json; charset=UTF-8
Origin: https://www.marketwatch.com
Referer: https://www.marketwatch.com/investing/stock/sbux
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.44
X-Hash-Result: 701c19ee3f45d07b56b40fb8e313214d
X-Token-Value: 900c4055-ef7a-74a8-e9ec-f78f7edc363b

HTTP/1.1 200 OK
Cache-Control: max-age=0, no-cache, no-store
Connection: keep-alive
Content-Length: 17
Content-Type: application/json; charset=utf-8
Date: Sun, 26 Aug 2018 05:12:16 GMT
Expires: Sun, 26 Aug 2018 05:12:16 GMT
Pragma: no-cache
Set-Cookie: ncg_g_id_zeta=701c19ee3f45d07b56b40fb8e313214d; Path=/; HttpOnly

步骤/请求 3

这个请求被发送到带有在步骤 2 中获取的 cookie 的原始 URL。然后在响应中返回完整的 cookie,可以在步骤 1 中使用它来避免再次执行步骤 2 和 3。它还将显示整页数据。

GET /investing/stock/sbux HTTP/1.1
Host: www.marketwatch.com:443
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Cookie: ncg_g_id_zeta=701c19ee3f45d07b56b40fb8e313214d
Referer: https://www.marketwatch.com/investing/stock/sbux
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.44

HTTP/1.1 200 OK
Cache-Control: max-age=0, no-cache, no-store
Connection: keep-alive
Content-Encoding: gzip
Content-Length: 62944
Content-Type: text/html; charset=utf-8
Date: Sun, 26 Aug 2018 05:12:17 GMT
Expires: Sun, 26 Aug 2018 05:12:17 GMT
Pragma: no-cache
Server: Kestrel
Set-Cookie: seenads=0; expires=Sun, 26 Aug 2018 23:59:59 GMT; domain=.marketwatch.com; path=/
Set-Cookie: mw_loc=%7B%22country%22%3A%22CA%22%2C%22region%22%3A%22ON%22%2C%22city%22%3A%22MARKHAM%22%2C%22county%22%3A%5B%22%22%5D%2C%22continent%22%3A%22NA%22%7D; expires=Sat, 01 Sep 2018 23:59:59 GMT; domain=.marketwatch.com; path=/
Vary: Accept-Encoding
x-frame-options: SAMEORIGIN
x-machine: 8cfa9f20bf3eb

概括

总而言之,步骤 2 是获得剩余 cookie 的最重要的部分......但我无法弄清楚这 3 件事:

1) POST url的来源(未嵌入原始页面,每个人的URL都相同还是由网站随机生成)。

2) POST 请求中发送的数据是什么?

3) X-Hash-Result 和 X-Token-Value 来自哪里?是否需要在请求的标头中发送?

4

1 回答 1

0

我试图让附加的cookie字符串起作用。MarketWatch 在保护他们的数据方面做得相当不错。为了构建整个 cookie,您将需要一个 wsj API 密钥(我认为是他们网站的财务数据供应商)和一些隐藏变量,这些变量可能仅对客户端的服务器可用,并且根据您的网络驱动程序或缺乏而严格保留。

例如,如果您尝试使用请求:POST https://browser.pipe.aria.microsoft.com/Collector/3.0/?qsp=true&content-type=application/bond-compact-binary&client-id=NO_AUTH&sdk-version= ACT-Web-JS-2.7.1&x-apikey=c34cce5c21da4a91907bc59bce4784fb-42e261e9-5073-49df-a2e1-42415e012bc6-6954

您将收到 400 未经授权的错误。

请记住,客户端主机服务器集群主服务器和与之通信的各种 API 也很有可能在我们的浏览器无法获取网络流量的情况下进行通信。例如,这可以通过某种中间件来完成。我相信这可能是缺少 X-Hash-Result 和 X-Token-Value 值的原因。

我并不是说构建这个 cookie 字符串是不可能的,只是说它在开发时间和精力方面是一条低效的路线。我现在也质疑这种方法在使用除 AAPL 之外的不同代码方面的可扩展性。除非明确要求不使用 Web 驱动程序和/或脚本需要高度可移植,且不允许在 pip install 之外进行任何配置,否则我不会选择这种方法。

这基本上给我们留下了一个 Scrapy Spider 或一个 Selenium Scraper(不幸的是,还有一些额外的环境配置,但如果你想编写和部署网络爬虫,学习非常重要的技能。一般来说,requests + bs4 是理想的简单爬虫/不寻常的代码可移植性需求)。

我继续使用 PhantomJS Web 驱动程序为您编写了一个 Selenium Scraper ETL 类。它接受股票代码字符串作为参数,并适用于除 AAPL 之外的其他股票。这很棘手,因为 marketwatch.com 不会重定向来自 PhantomJS Web 驱动程序的流量(我可以看出他们已经花费了大量资源试图阻止网络爬虫顺便说一句。比 yahoo.com 更是如此)。

无论如何,这是最终的 Selenium 脚本,它在 Python 2 和 3 上运行:

# Market Watch Test Scraper ETL
# Tested on python 2.7 and 3.5
# IMPORTANT: Ensure PhantomJS Web Driver is configured and installed

import pip
import sys
import signal
import time


# Package installer function to handle missing packages
def install(package):
    print(package + ' package for Python not found, pip installing now....')
    pip.main(['install', package])
    print(package + ' package has been successfully installed for Python\n Continuing Process...')

# Ensure beautifulsoup4 is installed
try:
    from bs4 import BeautifulSoup
except:
    install('beautifulsoup4')
    from bs4 import BeautifulSoup

# Ensure selenium is installed
try:
    from selenium import webdriver
    from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
except:
    install('selenium')
    from selenium import webdriver
    from selenium.webdriver.common.desired_capabilities import DesiredCapabilities


# Class to extract and transform raw marketwatch.com financial data
class MarketWatchETL:

    def __init__(self, ticker):
        self.ticker = ticker.upper()
        # Set up desired capabilities to spoof Firefox since marketwatch.com rejects any PhantomJS Request
        self._dcap = dict(DesiredCapabilities.PHANTOMJS)
        self._dcap["phantomjs.page.settings.userAgent"] = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) "
                                                           "AppleWebKit/537.36 (KHTML, like Gecko) "
                                                           "Chrome/29.0.1547.57 Safari/537.36")
        self._base_url = 'https://www.marketwatch.com/investing/stock/'
        self._retries = 10

    # Private Static Method to clean and organize Key Data Extract
    @staticmethod
    def _cleaned_key_data_object(raw_data):
        cleaned_data = {}
        raw_labels = raw_data['labels']
        raw_values = raw_data['values']
        i = 0
        for raw_label in raw_labels:
            raw_value = raw_values[i]
            cleaned_data.update({str(raw_label.get_text()): raw_value.get_text()})
            i += 1
        return cleaned_data

    # Private Method to scrape data from MarketWatch's web page
    def _scrape_financial_key_data(self):
        raw_data_obj = {}
        try:
            driver = webdriver.PhantomJS(desired_capabilities=self._dcap)
        except:
            print('***SETUP ERROR: The PhantomJS Web Driver is either not configured or incorrectly configured!***')
            sys.exit(1)
        driver.get(self._base_url + self.ticker)
        i = 0
        while i < self._retries:
            try:
                time.sleep(3)
                html = driver.page_source
                soup = BeautifulSoup(html, "html.parser")
                labels = soup.find_all('small', class_="kv__label")
                values = soup.find_all('span', class_="kv__primary")
                if labels and values:
                    raw_data_obj.update({'labels': labels})
                    raw_data_obj.update({'values': values})
                    break
                else:
                    i += 1
            except:
                i += 1
                continue
        if i == self._retries:
            print('Please check your internet connection!\nUnable to connect...')
            sys.exit(1)
        driver.service.process.send_signal(signal.SIGTERM)
        driver.quit()
        return raw_data_obj

    # Public Method to return a Stock's Key Data Object
    def get_stock_key_data(self):
        raw_data = self._scrape_financial_key_data()
        return self._cleaned_key_data_object(raw_data)


# Script's Main Process to test MarketWatchETL('TICKER')
if __name__ == '__main__':

    # Run financial key data extracts for Microsoft, Apple, and Wells Fargo
    msft_key_data = MarketWatchETL('MSFT').get_stock_key_data()
    aapl_key_data = MarketWatchETL('AAPL').get_stock_key_data()
    wfc_key_data = MarketWatchETL('WFC').get_stock_key_data()

    # Print result dictionaries
    print(msft_key_data.items())
    print(aapl_key_data.items())
    print(wfc_key_data.items())

哪个输出:

dict_items([('Rev. per Employee', '$841.03K'), ('Short Interest', '44.63M'), ('Yield', '1.53%'), ('Market Cap', '$831.23B'), ('Open', '$109.27'), ('EPS', '$2.11'), ('Shares Outstanding', '7.68B'), ('Ex-Dividend Date', 'Aug 15, 2018'), ('Day Range', '108.51 - 109.64'), ('Average Volume', '25.43M'), ('Dividend', '$0.42'), ('Public Float', '7.56B'), ('P/E Ratio', '51.94'), ('% of Float Shorted', '0.59%'), ('52 Week Range', '72.05 - 111.15'), ('Beta', '1.21')])
dict_items([('Rev. per Employee', '$2.08M'), ('Short Interest', '42.16M'), ('Yield', '1.34%'), ('Market Cap', '$1.04T'), ('Open', '$217.15'), ('EPS', '$11.03'), ('Shares Outstanding', '4.83B'), ('Ex-Dividend Date', 'Aug 10, 2018'), ('Day Range', '216.33 - 218.74'), ('Average Volume', '24.13M'), ('Dividend', '$0.73'), ('Public Float', '4.82B'), ('P/E Ratio', '19.76'), ('% of Float Shorted', '0.87%'), ('52 Week Range', '149.16 - 219.18'), ('Beta', '1.02')])
dict_items([('Rev. per Employee', '$384.4K'), ('Short Interest', '27.44M'), ('Yield', '2.91%'), ('Market Cap', '$282.66B'), ('Open', '$58.87'), ('EPS', '$3.94'), ('Shares Outstanding', '4.82B'), ('Ex-Dividend Date', 'Aug 9, 2018'), ('Day Range', '58.76 - 59.48'), ('Average Volume', '18.45M'), ('Dividend', '$0.43'), ('Public Float', '4.81B'), ('P/E Ratio', '15.00'), ('% of Float Shorted', '0.57%'), ('52 Week Range', '49.27 - 66.31'), ('Beta', '1.13')])

在运行此之前,您需要做的唯一额外步骤是在您的部署环境中安装和配置 PhantomJS Web 驱动程序。如果您需要像这样自动部署 web-scraper,您可以编写一个 bash/power shell 安装程序脚本来处理预配置环境的 PhantomJS。

安装和配置 PhantomJS 的一些资源:

Windows/Mac PhantomJS 安装可执行文件

Debian Linux PhantomJS 安装指南

RHEL PhantomJS 安装指南

我只是怀疑按照我在您之前的帖子中建议的方式组装 Cookie 的实用性甚至可能性。

我认为这里的另一个实际可能性是编写一个 Scrapy Crawler。

于 2018-08-28T02:50:51.533 回答