我也尝试过使用像 browsermob 代理这样的代理来获取 har 文件
我做了很多研究,因为我收到的文件总是空的。
我所做的是启用浏览器性能日志。
请注意,这仅适用于 chrome 驱动程序。
这是我的驱动程序类(在 python 中)
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium import webdriver
from lib.config import config
class Driver:
global performance_log
capabilities = DesiredCapabilities.CHROME
capabilities['loggingPrefs'] = {'performance': 'ALL'}
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument("--headless")
mobile_emulation = {"deviceName": "Nexus 5"}
if config.Env().is_mobile():
chrome_options.add_experimental_option(
"mobileEmulation", mobile_emulation)
else:
pass
chrome_options.add_experimental_option(
'perfLoggingPrefs', {"enablePage": True})
def __init__(self):
self.instance = webdriver.Chrome(
executable_path='/usr/local/bin/chromedriver', options=self.chrome_options)
def navigate(self, url):
if isinstance(url, str):
self.instance.get(url)
self.performance_log = self.instance.get_log('performance')
else:
raise TypeError("URL must be a string.")
在输出中发现的信息量很大,因此您必须过滤原始数据并仅让网络接收和发送对象。
import json
import secrets
def digest_log_data(performance_log):
# write all raw data in a file
with open('data.json', 'w', encoding='utf-8') as outfile:
json.dump(performance_log, outfile)
# open the file and real it with encoding='utf-8'
with open('data.json', encoding='utf-8') as data_file:
data = json.loads(data_file.read())
return data
def digest_raw_data(data, mongo_object={}):
for idx, val in enumerate(data):
data_object = json.loads(data[idx]['message'])
if (data_object['message']['method'] == 'Network.responseReceived') or (data_object['message']['method'] == 'Network.requestWillBeSent'):
mongo_object[secrets.token_hex(30)] = data_object
else:
pass
我们选择将这些数据推送到 mongo db 中,稍后将由 etl 进行分析并推送到 redshift 数据库中以创建统计信息。
我希望是你正在寻找的。
我运行脚本的方式是:
import codecs
from pprint import pprint
import urllib
from lib import mongo_client
from lib.test_data import test_data as data
from jsonpath_ng.ext import parse
from IPython import embed
from lib.output_data import process_output_data as output_data
from lib.config import config
from lib import driver
browser = driver.Driver()
# get the list of urls which we need to navigate
urls = data.url_list()
for url in urls:
browser.navigate(config.Env().base_url() + url)
print('Visiting ' + url)
# get performance log
performance_log = browser.performance_log
# digest the performace log
data = output_data.digest_log_data(performance_log)
# initiate an empty dict
mongo_object = {}
# prepare the data for the mongo document
output_data.digest_raw_data(data, mongo_object)
# load data into the mongo db
mongo_client.populate_mongo(mongo_object)
browser.instance.quit()
我的主要来源是这个,我已经根据我的需要对其进行了调整。
https://www.reddit.com/r/Python/comments/97m9iq/headless_browsers_export_to_har/
谢谢