python - 坏线状态：| 带有 http.client 的 Python - 适用于某些站点，但不适用于其他站点

Question

import http.client
import csv

def http_get(url, path, headers):
    try:
        conn = http.client.HTTPConnection(url)
        print ('Connecting to ' + url)
        conn.request(url, path, headers=headers)
        resp = conn.getresponse()
        if resp.status<=400:
            body = resp.read()
            print ('Reading Source...')
    except Exception as e:
        raise Exception('Connection Error: %s' % e)
        pass
    finally:
        conn.close()
        print ('Connection Closed')

    if resp.status >= 400:
        print (url)
        raise ValueError('Response Error: %s, %s, URL: %s' % (resp.status, resp.reason,url))
    return body


with open('domains.csv','r') as csvfile:
    urls = [row[0] for row in csv.reader(csvfile)]

L = ['Version 0.7','Version 1.2','Version 1.5','Version 2.0','Version 2.1','Version 2.3','Version 2.5','Version 2.6','Version 2.7','Version 2.8','Version 2.9','Version 2.9','Version 3.0','Version 3.1','Version 3.2','Version 3.3','Version 3.4','Version 3.5.1','Version 3.5.2']
PATH = '/'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
HEADERS = {'User-Agent': user_agent}

for url in urls:        
    HOST = url

    print ('Testing WordPress Installation on ' + url)
    http_get(HOST,PATH,HEADERS)

我已经看了一两个星期了，我发现了类似的错误，但是我不明白为什么它适用于 csv 文件中的某些网站而不适用于其他网站。我检查了服务器，发现它默认丢弃了 ICMP 数据包，所以我更改了它，现在 traceroute 和 ping 都 100% 收到，而不是之前的 100% 丢失。我认为这是相关的，因为该主机上的所有站点都有相同的问题。但是我的脚本仍然抛出异常：

mud@alex-BBVM:~/Desktop/scripts$ python3 httpTest.py
Testing WordPress Installation on XXXXX.ie
Connecting to exsite.ie
Reading Source...
Connection Closed
Testing WordPress Installation on AAAAAA.com
Connecting to AAAAA.com
Reading Source...
Connection Closed
Testing WordPress Installation on YYYYY.ie
Connecting to YYYYY.ie
Reading Source...
Connection Closed
Testing WordPress Installation on CCCCC.ie
Connecting to CCCCCC.ie
Reading Source...
Connection Closed
Testing WordPress Installation on DDDDDDD.ie
Connecting to DDDDDDD.ie
Connection Closed
Traceback (most recent call last):
  File "httpTest.py", line 9, in http_get
    resp = conn.getresponse()
  File "/usr/lib/python3.2/http/client.py", line 1049, in getresponse
    response.begin()
  File "/usr/lib/python3.2/http/client.py", line 346, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.2/http/client.py", line 328, in _read_status
    raise BadStatusLine(line)
http.client.BadStatusLine: <html>


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "httpTest.py", line 38, in <module>
    http_get(HOST,PATH,HEADERS)
  File "httpTest.py", line 14, in http_get
    raise Exception('Connection Error: %s' % e)
Exception: Connection Error: <html>

我显然已经用占位符替换了 url，因为它们是客户地址，我不想在这里发布它们。

无论如何，任何见解或帮助表示赞赏。

我已经阅读了 http.client 的文档及其相关异常，但是我似乎无法从我从中收集到的内容中提取解决方案。

谢谢！

score 0 · Accepted Answer

首先，我建议您HTTPResponse在调用之前始终从对象中读取所有内容conn.close()。甚至 404 响应都包含一个文档。

我对您的回溯感到困惑，据我所知http.client.BadStatusLine，您的except Exception.

通常，一个except Exception子句不是一个好主意，因为除非您重新提出相同的异常（您不是），否则您可能会掩盖潜在的问题。无论如何，当代码没有按预期工作时，这是应该做的第一件事。

此外，您提供的输出似乎与您提供的代码不匹配。

具体来说，根据回溯：

Connection Closed
Traceback (most recent call last):
  File "httpTest.py", line 9, in http_get
    resp = conn.getresponse()

之前的代码有一个print ('Connecting to ' + url)：

print ('Connecting to ' + url)
conn.request(url, path, headers=headers)
resp = conn.getresponse()

但是输出中回溯之前的行是Connection Closed.

更新

忽略try / finally.

http.client.BadStatusLine当初始响应不是HTTP/1.1 200 OK. 在这种特殊情况下，它是<html>相反的。

服务器正在返回没有 HTTP 标头的文档。或者这是代码中的意外行为。

我重复我已经说过的话：总是从HTTPResponse对象中读取所有内容。

数据包捕获将确认此服务器通过网络传输的内容。

python - 坏线状态：| 带有 http.client 的 Python - 适用于某些站点，但不适用于其他站点

1 回答 1

Related

Reference