原帖 06-22-2021 @12:00 UTC
错误:
HTTP protocol error in client request: Server disconnected
被mitmproxy 抛出。 我从mitmproxy源代码中提取了以下内容。
"clientconnect": "client_connected",
"clientdisconnect": "client_disconnected",
"serverconnect": "server_connect and server_connected",
"serverdisconnect": "server_disconnected",
class ServerDisconnectedHook(commands.StartHook):
"""
A server connection has been closed (either by us or the server).
"""
blocking = False
data: ServerConnectionHookData
我建议将您的代码放在 Try except 块中,这将允许您抑制mitmproxy引发的错误。
from mitmproxy.exceptions import MitmproxyException
from mitmproxy.exceptions import HttpReadDisconnect
try:
your driver code
except HttpReadDisconnect as e:
pass
except MitmproxyException as e:
"""
Base class for all exceptions thrown by mitmproxy.
"""
pass
finally:
driver.quit()
如果我提供的异常没有抑制您的错误,那么我建议您尝试mitmproxy中的其他一些异常。
更新 06-22-2021 @15:28 UTC
在我的研究中,我注意到seleniumwire具有与mitmproxy 的集成代码。此集成的一部分是捕获 *mitmproxy 抛出的错误消息。”
class SendToLogger:
def log(self, entry):
"""Send a mitmproxy log message through our own logger."""
getattr(logger, entry.level.replace('warn', 'warning'), logger.info)(entry.msg)
在我的测试中,使用mitmproxy.exceptions抑制有问题的错误是很困难的。在测试以下异常时,唯一触发的是HttpReadDisconnect。而且那次射击并不一致。
- 异常
- HttpReadDisconnect
- HttpProtocolException
- Http2ProtocolException
- MitmproxyException
- 服务器异常
- 异常
我注意到如果我添加了一个标准异常:
except Exception as error:
print('standard')
print(''.join(traceback.format_tb(error.__traceback__)))
您的代码中的这一行始终引发错误:
File "/Users/user_name/Python_Projects/scratch_pad/seleniumwire_test.py", line 18, in <module>
assert driver.requests[0].response.status_code == 200
当我更详细地查看此错误时,我发现它与status_code 有关。
<class 'AttributeError'>
'NoneType' object has no attribute 'status_code'
更新 06-23-2021 @15:04 UTC
在我的研究中,我发现selenium有一个service_log_path 参数,可以添加到webdriver.Chrome()。
class WebDriver(ChromiumDriver):
def __init__(self, executable_path="chromedriver", port=DEFAULT_PORT,
options: Options = None, service_args=None,
desired_capabilities=None, service_log_path=DEFAULT_SERVICE_LOG_PATH,
chrome_options=None, service: Service = None, keep_alive=DEFAULT_KEEP_ALIVE):
根据文档,这个参数可以这样使用:service_log_path=/dev/null
不幸的是,WebDriver(ChromiumDriver) 类中的注释表明该参数已被弃用。它也未能抑制sys.stdout错误消息。
service_log_path - 已弃用:从驱动程序记录信息的位置。
当前状态
我重新编写了您的代码并删除了引发错误的status_code行。我添加了一些implicitly_wait()和一些WebDriverWait语句来处理您尝试使用status_code语句执行的操作。我还添加了一些错误处理来捕获特定的错误消息类型。我添加了一些chrome_options来抑制某些事情,例如加载网站图像,这些对于抓取目标网站是不需要的。
最后,我添加了一个自定义日志功能来抑制发送到sys.stdout的错误消息。我多次测试了代码,到目前为止我还没有收到sys.stdout的错误消息。如果您再次收到消息,可能需要进行更多测试。
这是实际代码的链接。
import sys
import logging
import traceback
from seleniumwire import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from mitmproxy.exceptions import HttpReadDisconnect, TcpDisconnect, TlsException
class DisableLogger():
def __enter__(self):
logging.disable(logging.WARNING)
def __exit__(self, exit_type, exit_value, exit_traceback):
logging.disable(logging.NOTSET)
options = {
"backend": "mitmproxy",
'mitm_http2': False,
'disable_capture': True,
'verify_ssl': True,
'connection_keep_alive': False,
'max_threads': 3,
'connection_timeout': None,
'proxy': {
'https': 'https://209.40.237.43:8080',
}
}
chrome_options = Options()
chrome_options.add_argument(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36")
chrome_options.add_argument("--start-maximized")
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-logging')
chrome_options.add_argument("--disable-application-cache")
chrome_options.add_argument("--ignore-certificate-errors")
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
webdriver.DesiredCapabilities.CHROME['acceptSslCerts'] = True
prefs = {
"profile.managed_default_content_settings.images": 2,
"profile.default_content_settings.images": 2
}
capabilities = webdriver.DesiredCapabilities.CHROME
chrome_options.add_experimental_option("prefs", prefs)
capabilities.update(chrome_options.to_capabilities())
driver = webdriver.Chrome(executable_path='/usr/local/bin/chromedriver',
options=chrome_options, seleniumwire_options=options)
with DisableLogger():
driver.implicitly_wait(60)
try:
driver.get('https://www.zillow.com/Houston,-TX/houses/')
wait = WebDriverWait(driver, 240)
page_title = wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="search-page-react-content"]')))
if page_title:
post_links = [i.get_attribute("href") for i in driver.find_elements_by_css_selector("article[role='presentation'] > .list-card-info > a.list-card-link")]
for individual_link in post_links:
driver.implicitly_wait(60)
driver.get(individual_link)
post_title = driver.find_element_by_css_selector("h1").text
print(post_title)
except HttpReadDisconnect as error:
print('A HttpReadDisconnect Exception has occurred')
exc_type, exc_value, exc_tb = sys.exc_info()
print(exc_type)
print(exc_value)
print(''.join(traceback.format_tb(error.__traceback__)))
driver.quit()
except TimeoutException as error:
print('A TimeOut Exception has occurred')
exc_type, exc_value, exc_tb = sys.exc_info()
print(exc_type)
print(exc_value)
print(''.join(traceback.format_tb(error.__traceback__)))
driver.quit()
except TcpDisconnect as error:
print('A TCP Disconnect Exception has occurred')
exc_type, exc_value, exc_tb = sys.exc_info()
print(exc_type)
print(exc_value)
print(''.join(traceback.format_tb(error.__traceback__)))
driver.quit()
except TlsException as error:
print('A TLS Exception has occurred')
exc_type, exc_value, exc_tb = sys.exc_info()
print(exc_type)
print(exc_value)
print(''.join(traceback.format_tb(error.__traceback__)))
driver.quit()
except Exception as error:
print('An exception has occurred')
print(''.join(traceback.format_tb(error.__traceback__)))
pass
finally:
driver.quit()
观察结果
我注意到您使用的是免费代理而不是付费代理服务。我发现代码hxxps://136.226.33.115:80中的代理是标准 HTTP 代理,它也存在延迟问题,导致连接到目标网站时超时。
另一个观察结果是您的目标网站有验证码,当您发送太多连接请求时会触发验证码。
我还注意到您的代理服务器也会出现连接问题,这会导致错误消息被发送到sys.stdout。这就是您可能遇到的情况。
边注
您的代码中的 selenium 会话偶尔会遇到来自 Zillow的I am human captcha。
----------------------------------------
My system information
----------------------------------------
Platform: Mac
Python Version: 3.9
Seleniumwire: 4.3.1
Selenium: 3.141.0
mitmproxy: 6.0.2
browserVersion: 91.0.4472.114
chromedriverVersion: 90.0.4430.24
IDE: PyCharm 2021.1.2
----------------------------------------