python - 在 Python 中获取 HTTP 响应的字符集/编码的好方法

Question

寻找一种使用 Python urllib2 或任何其他 Python 库获取 HTTP 响应的字符集/编码信息的简单方法。

>>> url = 'http://some.url.value'
>>> request = urllib2.Request(url)
>>> conn = urllib2.urlopen(request)
>>> response_encoding = ?

我知道它有时会出现在“Content-Type”标头中，但该标头还有其他信息，并且它嵌入在我需要解析的字符串中。例如，谷歌返回的 Content-Type 标头是

>>> conn.headers.getheader('content-type')
'text/html; charset=utf-8'

我可以使用它，但我不确定格式的一致性。我很确定 charset 可能完全丢失，所以我必须处理这种极端情况。某种字符串拆分操作以从中获取 'utf-8' 似乎必须是做这种事情的错误方法。

>>> content_type_header = conn.headers.getheader('content-type')
>>> if '=' in content_type_header:
>>>  charset = content_type_header.split('=')[1]

那种感觉就像是在做太多工作的代码。我也不确定它是否适用于所有情况。有没有人有更好的方法来做到这一点？

score 28 · Accepted Answer

要解析 http 标头，您可以使用cgi.parse_header()：

_, params = cgi.parse_header('text/html; charset=utf-8')
print params['charset'] # -> utf-8

或者使用响应对象：

response = urllib2.urlopen('http://example.com')
response_encoding = response.headers.getparam('charset')
# or in Python 3: response.headers.get_content_charset(default)

一般来说，服务器可能对编码撒谎或根本不报告（默认取决于内容类型），或者编码可能在响应正文中指定，例如，<meta>html 文档中的元素或 xml 文档的 xml 声明中。作为最后的手段，可以从内容本身猜测编码。

您可以requests用来获取 Unicode 文本：

import requests # pip install requests

r = requests.get(url)
unicode_str = r.text # may use `chardet` to auto-detect encoding

或BeautifulSoup解析 html（并转换为 Unicode 作为副作用）：

from bs4 import BeautifulSoup # pip install beautifulsoup4

soup = BeautifulSoup(urllib2.urlopen(url)) # may use `cchardet` for speed
# ...

或bs4.UnicodeDammit直接用于任意内容（不一定是 html）：

from bs4 import UnicodeDammit

dammit = UnicodeDammit(b"Sacr\xc3\xa9 bleu!")
print(dammit.unicode_markup)
# -> Sacré bleu!
print(dammit.original_encoding)
# -> utf-8

score 7 · Accepted Answer

如果您碰巧熟悉Flask / Werkzeug Web 开发堆栈，您会很高兴知道 Werkzeug 库为这种 HTTP 标头解析提供了答案，并解释了在以下位置未指定内容类型的情况一切，如你所愿。

 >>> from werkzeug.http import parse_options_header
 >>> import requests
 >>> url = 'http://some.url.value'
 >>> resp = requests.get(url)
 >>> if resp.status_code is requests.codes.ok:
 ...     content_type_header = resp.headers.get('content_type')
 ...     print content_type_header
 'text/html; charset=utf-8'
 >>> parse_options_header(content_type_header) 
 ('text/html', {'charset': 'utf-8'})

那么你可以这样做：

 >>> content_type_header[1].get('charset')
 'utf-8'

请注意，如果charset未提供，则会生成：

 >>> parse_options_header('text/html')
 ('text/html', {})

如果您只提供空字符串或字典，它甚至可以工作：

 >>> parse_options_header({})
 ('', {})
 >>> parse_options_header('')
 ('', {})

因此，它似乎正是您想要的！如果您查看源代码，您会发现他们心中有您的目的：https ://github.com/mitsuhiko/werkzeug/blob/master/werkzeug/http.py#L320-329

def parse_options_header(value):
    """Parse a ``Content-Type`` like header into a tuple with the content
    type and the options:
    >>> parse_options_header('text/html; charset=utf8')
    ('text/html', {'charset': 'utf8'})
    This should not be used to parse ``Cache-Control`` like headers that use
    a slightly different format.  For these headers use the
    :func:`parse_dict_header` function.
    ...

希望有一天这对某人有所帮助！:)

score 5 · Accepted Answer

该requests库使这很容易：

>>> import requests
>>> r = requests.get('http://some.url.value')
>>> r.encoding
'utf-8' # e.g.

score 3 · Accepted Answer

可以通过多种方式指定字符集，但通常在标题中这样做。

>>> urlopen('http://www.python.org/').info().get_content_charset()
'utf-8'
>>> urlopen('http://www.google.com/').info().get_content_charset()
'iso-8859-1'
>>> urlopen('http://www.python.com/').info().get_content_charset()
>>>

最后一个没有在任何地方指定字符集，所以get_content_charset()返回了None。

score 1 · Accepted Answer

要正确（即以类似浏览器的方式 - 我们不能做得更好）解码您需要考虑的 html：

Content-Type HTTP 标头值；
BOM标记；
<meta>页面正文中的标签；
Web 中定义的编码名称之间的差异和 Python stdlib 中可用的编码名称；
作为最后的手段，如果一切都失败了，基于统计的猜测是一种选择。

以上所有都在w3lib.encoding.html_to_unicode函数中实现：它具有html_to_unicode(content_type_header, html_body_str, default_encoding='utf8', auto_detect_fun=None)签名并返回(detected_encoding, unicode_html_content)元组。

requests、BeautifulSoup、UnicodeDamnnit、chardet 或 flask 的 parse_options_header 都不是正确的解决方案，因为它们在其中一些点上都失败了。

score 0 · Accepted Answer

这对我来说是完美的。我正在使用 python 2.7 和 3.4

print (text.encode('cp850','replace'))

python - 在 Python 中获取 HTTP 响应的字符集/编码的好方法

6 回答 6

Related

Reference