我正在尝试将脚本移植到 python 3,以提交此处找到的 XML 提要:
https://developers.google.com/search-appliance/documentation/files/pushfeed_client.py.txt
运行 2to3.py 并进行一些小调整以删除任何语法错误后,脚本失败并显示如下:
(py33dev) d:\dev\workspace>python pushfeed_client.py --datasource="TEST1" --feedtype="full" --url="http://gsa:19900/xmlfeed" --xmlfilename="test.xml"
Traceback (most recent call last):
File "pushfeed_client.py", line 108, in <module>
main(sys.argv)
File "pushfeed_client.py", line 56, in main
result = urllib.request.urlopen(request_url)
File "C:\Python33\Lib\urllib\request.py", line 156, in urlopen
return opener.open(url, data, timeout)
File "C:\Python33\Lib\urllib\request.py", line 469, in open
response = self._open(req, data)
File "C:\Python33\Lib\urllib\request.py", line 487, in _open
'_open', req)
File "C:\Python33\Lib\urllib\request.py", line 447, in _call_chain
result = func(*args)
File "C:\Python33\Lib\urllib\request.py", line 1268, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "C:\Python33\Lib\urllib\request.py", line 1253, in do_open
r = h.getresponse()
File "C:\Python33\Lib\http\client.py", line 1147, in getresponse
response.begin()
File "C:\Python33\Lib\http\client.py", line 358, in begin
version, status, reason = self._read_status()
File "C:\Python33\Lib\http\client.py", line 340, in _read_status
raise BadStatusLine(line)
http.client.BadStatusLine: <!DOCTYPE html>
为什么它会在服务器的响应中返回该异常?当我嗅探会话时,这是 GSA 的完整回复:
<!DOCTYPE html>
<html lang=en>
<meta charset=utf-8>
<meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
<title>Error 400 (Bad Request)!!1</title>
<style>
*{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}
</style>
<a href=//www.google.com/><img src=//www.google.com/images/errors/logo_sm.gif alt=Google></a>
<p><b>400.</b> <ins>That’s an error.</ins>
<p>Your client has issued a malformed or illegal request. <ins>That’s all we know.</ins>
它确实返回了 HTTP 400。只要 XML 有效负载中包含 utf-8 字符,我就可以可靠地导致此问题。当它是普通的 ascii 时,它可以完美地工作。这是我可以用来可靠地重新创建问题的最基本的代码版本:
import http.client
http.client.HTTPConnection.debuglevel = 1
with open("GSA_full_Feed.xml", encoding='utf-8') as xdata:
payload = xdata.read()
content_length = len(payload)
feed_path = "xmlfeed"
content_type = "multipart/form-data; boundary=----------boundary_of_feed_data$"
headers = {"Content-type": content_type, "Content-length": content_length}
conn = http.client.HTTPConnection("gsa", 19900)
conn.request("POST", feed_path, body=payload.encode("utf-8"), headers=headers)
res = conn.getresponse()
print(res.read())
conn.close()
这是一个用于导致异常的示例 XML 有效负载:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE gsafeed PUBLIC "-//Google//DTD GSA Feeds//EN" "gsafeed.dtd">
<gsafeed>
<header>
<datasource>TEST1</datasource>
<feedtype>full</feedtype>
</header>
<group>
<record action="add" mimetype="text/html" url="https://myschweetassurl.com">
<metadata>
<meta content="shit happens, then you die" name="description"/>
</metadata>
<content>wacky Umläut test of non utf-8 characters</content>
</record>
</group>
</gsafeed>
我可以在 2 和 3 版本之间找到的唯一增量是每个请求的内容长度标头。Python 3 版本始终比 2 版本短,870 与 873。