1

我正在尝试使用 python 结合seleniumwire在其中实现 poxies 创建一个脚本。该脚本有时可以正常工作,但即使 status_code 为 200,大多数时候也会出现错误日志详细信息。我希望摆脱那些日志详细信息。脚本中硬编码的 IP 地址取自免费代理站点,因此目前可能没有任何用处。

这就是我正在尝试的:

from seleniumwire import webdriver

URL = 'https://www.zillow.com/Houston,-TX/houses/'

options = {
    'mitm_http2': False,
    'proxy': {'https': f'https://136.226.33.115:80'}
}

driver = webdriver.Chrome(seleniumwire_options=options)

driver.get(URL)
assert driver.requests[0].response.status_code==200
post_links = [i.get_attribute("href") for i in driver.find_elements_by_css_selector("article[role='presentation'] > .list-card-info > a.list-card-link")]
for individual_link in post_links:
    driver.get(individual_link)
    assert driver.requests[0].response.status_code==200
    post_title = driver.find_element_by_css_selector("h1").text
    print(post_title)
driver.quit()

这是我可以在控制台中看到的错误日志详细信息类型:

127.0.0.1:55825: request
  -> HTTP protocol error in client request: Server disconnected
127.0.0.1:55967: request
  -> HTTP protocol error in client request: Server disconnected
127.0.0.1:64891: request
  -> HTTP protocol error in client request: Server disconnected
127.0.0.1:61466: request
  -> HTTP protocol error in client request: Server disconnected
127.0.0.1:51332: request
  -> HTTP protocol error in client request: Server disconnected
127.0.0.1:52783: request
  -> HTTP protocol error in client request: Server disconnected

如何强制脚本不打印这些日志详细信息?

4

3 回答 3

2

原帖 06-22-2021 @12:00 UTC

错误:

HTTP protocol error in client request: Server disconnected

mitmproxy 抛出。 我从mitmproxy源代码中提取了以下内容。

"clientconnect": "client_connected",
"clientdisconnect": "client_disconnected",
"serverconnect": "server_connect and server_connected",
"serverdisconnect": "server_disconnected",


class ServerDisconnectedHook(commands.StartHook):
    """
    A server connection has been closed (either by us or the server).
    """
    blocking = False
    data: ServerConnectionHookData

我建议将您的代码放在 Try except 块中,这将允许您抑制mitmproxy引发的错误。

from mitmproxy.exceptions import MitmproxyException
from mitmproxy.exceptions import HttpReadDisconnect

try:
  your driver code
except HttpReadDisconnect as e:
    pass
except MitmproxyException as e:
    """
    Base class for all exceptions thrown by mitmproxy.
    """
    pass
finally:
  driver.quit()

如果我提供的异常没有抑制您的错误,那么我建议您尝试mitmproxy中的其他一些异常。

在此处输入图像描述

更新 06-22-2021 @15:28 UTC

在我的研究中,我注意到seleniumwire具有与mitmproxy 的集成代码。此集成的一部分是捕获 *mitmproxy 抛出的错误消息。”

class SendToLogger:

    def log(self, entry):
        """Send a mitmproxy log message through our own logger."""
        getattr(logger, entry.level.replace('warn', 'warning'), logger.info)(entry.msg)

在我的测试中,使用mitmproxy.exceptions抑制有问题的错误是很困难的。在测试以下异常时,唯一触发的是HttpReadDisconnect。而且那次射击并不一致。

  • 异常
  • HttpReadDisconnect
  • HttpProtocolException
  • Http2ProtocolException
  • MitmproxyException
  • 服务器异常
  • 异常

我注意到如果我添加了一个标准异常:

except Exception as error:
    print('standard')
    print(''.join(traceback.format_tb(error.__traceback__)))

您的代码中的这一行始终引发错误:

 File "/Users/user_name/Python_Projects/scratch_pad/seleniumwire_test.py", line 18, in <module>
    assert driver.requests[0].response.status_code == 200

当我更详细地查看此错误时,我发现它与status_code 有关。

<class 'AttributeError'>
'NoneType' object has no attribute 'status_code'

更新 06-23-2021 @15:04 UTC

在我的研究中,我发现selenium有一个service_log_path 参数,可以添加到webdriver.Chrome()

class WebDriver(ChromiumDriver):
   
    def __init__(self, executable_path="chromedriver", port=DEFAULT_PORT,
                 options: Options = None, service_args=None,
                 desired_capabilities=None, service_log_path=DEFAULT_SERVICE_LOG_PATH,
                 chrome_options=None, service: Service = None, keep_alive=DEFAULT_KEEP_ALIVE):

根据文档,这个参数可以这样使用:service_log_path=/dev/null

不幸的是,WebDriver(ChromiumDriver) 类中的注释表明该参数已被弃用。它也未能抑制sys.stdout错误消息。

service_log_path - 已弃用:从驱动程序记录信息的位置。

当前状态

我重新编写了您的代码并删除了引发错误的status_code行。我添加了一些implicitly_wait()和一些WebDriverWait语句来处理您尝试使用status_code语句执行的操作。我还添加了一些错误处理来捕获特定的错误消息类型。我添加了一些chrome_options来抑制某些事情,例如加载网站图像,这些对于抓取目标网站是不需要的。

最后,我添加了一个自定义日志功能来抑制发送到sys.stdout的错误消息。我多次测试了代码,到目前为止我还没有收到sys.stdout的错误消息。如果您再次收到消息,可能需要进行更多测试。

这是实际代码的链接

import sys
import logging
import traceback
from seleniumwire import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from mitmproxy.exceptions import HttpReadDisconnect, TcpDisconnect, TlsException


class DisableLogger():
    def __enter__(self):
       logging.disable(logging.WARNING)
    def __exit__(self, exit_type, exit_value, exit_traceback):
       logging.disable(logging.NOTSET)


options = {
    "backend": "mitmproxy",
    'mitm_http2': False,
    'disable_capture': True,
    'verify_ssl': True,
    'connection_keep_alive': False,
    'max_threads': 3,
    'connection_timeout': None,
    'proxy': {
        'https': 'https://209.40.237.43:8080',
    }
}

chrome_options = Options()
chrome_options.add_argument(
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36")
chrome_options.add_argument("--start-maximized")
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-logging')
chrome_options.add_argument("--disable-application-cache")
chrome_options.add_argument("--ignore-certificate-errors")
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)

webdriver.DesiredCapabilities.CHROME['acceptSslCerts'] = True

prefs = {
   "profile.managed_default_content_settings.images": 2,
   "profile.default_content_settings.images": 2
 }

capabilities = webdriver.DesiredCapabilities.CHROME
chrome_options.add_experimental_option("prefs", prefs)
capabilities.update(chrome_options.to_capabilities())

driver = webdriver.Chrome(executable_path='/usr/local/bin/chromedriver',
                          options=chrome_options, seleniumwire_options=options)

with DisableLogger():
    driver.implicitly_wait(60)
    try:
        driver.get('https://www.zillow.com/Houston,-TX/houses/')
        wait = WebDriverWait(driver, 240)
        page_title = wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="search-page-react-content"]')))
        if page_title:
            post_links = [i.get_attribute("href") for i in driver.find_elements_by_css_selector("article[role='presentation'] > .list-card-info > a.list-card-link")]
            for individual_link in post_links:
                driver.implicitly_wait(60)
                driver.get(individual_link)
                post_title = driver.find_element_by_css_selector("h1").text
                print(post_title)

    except HttpReadDisconnect as error:
        print('A HttpReadDisconnect Exception has occurred')
        exc_type, exc_value, exc_tb = sys.exc_info()
        print(exc_type)
        print(exc_value)
        print(''.join(traceback.format_tb(error.__traceback__)))
        driver.quit()

    except TimeoutException as error:
        print('A TimeOut Exception has occurred')
        exc_type, exc_value, exc_tb = sys.exc_info()
        print(exc_type)
        print(exc_value)
        print(''.join(traceback.format_tb(error.__traceback__)))
        driver.quit()

    except TcpDisconnect as error:
        print('A TCP Disconnect Exception has occurred')
        exc_type, exc_value, exc_tb = sys.exc_info()
        print(exc_type)
        print(exc_value)
        print(''.join(traceback.format_tb(error.__traceback__)))
        driver.quit()

    except TlsException as error:
        print('A TLS Exception has occurred')
        exc_type, exc_value, exc_tb = sys.exc_info()
        print(exc_type)
        print(exc_value)
        print(''.join(traceback.format_tb(error.__traceback__)))
        driver.quit()

    except Exception as error:
        print('An exception has occurred')
        print(''.join(traceback.format_tb(error.__traceback__)))
        pass

    finally:
        driver.quit()

观察结果

我注意到您使用的是免费代理而不是付费代理服务。我发现代码hxxps://136.226.33.115:80中的代理是标准 HTTP 代理,它也存在延迟问题,导致连接到目标网站时超时。

另一个观察结果是您的目标网站有验证码,当您发送太多连接请求时会触发验证码。

我还注意到您的代理服务器也会出现连接问题,这会导致错误消息被发送到sys.stdout。这就是您可能遇到的情况。

边注

您的代码中的 selenium 会话偶尔会遇到来自 Zillow的I am human captcha。

在此处输入图像描述

----------------------------------------
My system information
----------------------------------------

Platform: Mac 
Python Version: 3.9
Seleniumwire: 4.3.1
Selenium: 3.141.0
mitmproxy: 6.0.2
browserVersion: 91.0.4472.114
chromedriverVersion: 90.0.4430.24
IDE: PyCharm 2021.1.2

----------------------------------------
于 2021-06-22T11:37:40.187 回答
1

如果您在 linux 发行版中工作,您可以重定向错误输出。为此,您应该添加2>/dev/null到命令的末尾。例如,您可以像这样运行脚本:

python SCRIPT 2>/dev/null
于 2021-06-14T12:24:54.827 回答
1

您可以通过以下代码行定义所需的日志错误级别:

options .add_argument('--log-level=3')

将此添加到您的options.

log-level 属性设置最低日志级别。
有效值为 0 到 3:

信息 = 0,
警告 = 1,
LOG_ERROR = 2,
LOG_FATAL = 3。

默认为 0。

于 2021-06-14T12:28:02.500 回答