0

我正在使用 Python 的 Incapsula 模块,结合一些代理服务器改组以从网站上抓取数据供个人使用。该模块用于从网页中获取数据,我用它来创建变量my_header3,该变量在标题中使用在下面的代码中。

但是,当尝试执行相同的方法从 XHR 请求中获取数据时,会返回一个空白字符串。任何人都可以从下面的代码中看到我需要更改的内容吗?

from incapsula import crack, IncapSession
import requests
from cookielib import LWPCookieJar
import json
import SelectProxy

SelectProxy.select_proxy()
local_proxy = SelectProxy.global_proxy

session = requests.Session()
session.proxies = {local_proxy}

url = 'http://www.whoscored.com/tournamentsfeed/12496/Fixtures/'

params = {'d': '201508',
'isAggregate': 'false'}

headers = {
'authority': 'www.whoscored.com',
'method': 'GET',
'path': '/tournamentsfeed/12496/Fixtures/?d=2015W08&isAggregate=false',
'scheme': 'https',
'accept': 'text/plain, */*; q=0.01',
'accept-encoding': 'gzip, deflate, sdch, br',
'accept-language': 'en-GB,en-US;q=0.8,en;q=0.6',
'model-last-mode': my_header3,
'referer': 'https://www.whoscored.com/Regions/252/Tournaments/2/England-Premier-League',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
'x-requested-with': 'XMLHttpRequest'
}

response = session.get(url, params=params, headers=headers)
print 'response = ', response
response = crack(session, response)
print 'response = ', response

谢谢

编辑

my_header3 是使用下面的代码获取的。它是上述代码中请求标头的一部分,我不确定如何生成它。

session = requests.Session()
session.proxies = {local_proxy}
cookies = LWPCookieJar('cookiejar')

url = 'http://www.whoscored.com/Regions/252/Tournaments/2/England-Premier-League'
headers = {
'authority': 'www.whoscored.com',
'method': 'GET',
'path': '/',
'scheme': 'https',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, sdch, br',
'accept-language': 'en-GB,en-US;q=0.8,en;q=0.6',
'cache-control': 'max-age=0',
'referer': 'https://www.google.co.uk/',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'
}

response = session.get(url, headers=headers)  # url is blocked by incapsula
response = crack(session, response)  # url is no longer blocked by incapsula

regex = re.compile("'Model-Last-Mode': '.*?'", re.S)
my_header = re.search(regex, response.text)
my_header2 = my_header.group()

my_header3 = my_header2.replace("'Model-Last-Mode': '", '').replace("'", "")
print my_header3
4

0 回答 0