python - 从未知字符编码的字符串中转储 JSON

Question

我正在尝试将 HTML 从网站转储到 JSON，我需要一种方法来处理不同的字符编码。

我读过如果不是utf-8，它可能是ISO-8859-1，所以我现在正在做的是：

for possible_encoding in ["utf-8", "ISO-8859-1"]:
   try:
      # post_dict contains, among other things, website html retrieved
      # with urllib2
      json = simplejson.dumps(post_dict, encoding=possible_encoding)
      break
   except UnicodeDecodeError:
      pass
if json is None:
      raise UnicodeDecodeError

如果我遇到任何其他编码，这当然会失败，所以我想知道在一般情况下是否有办法解决这个问题。

我首先尝试序列化 HTML 的原因是因为我需要在 POST 请求中将它发送到我们的 NodeJS 服务器。因此，如果有人有不同的解决方案允许我这样做（可能根本不需要序列化为 JSON），我也很高兴听到这个消息。

score 1 · Accepted Answer

无论您用于发送 POST 请求的媒体类型如何，您都应该知道字符编码（除非您想发送二进制 blob）。要获取 html 内容的字符编码，请参阅 A good way to get the charset/encoding of an HTTP response in Python 。

要post_dict以 json 格式发送，请确保其中的所有字符串都是 Unicode（只需在收到 html 后立即将其转换为 Unicode）并且不要使用encoding参数进行json.dumps()调用。如果不同的网站（您获取 html 字符串的位置）使用不同的编码，则该参数无论如何都不会帮助您。

python - 从未知字符编码的字符串中转储 JSON

1 回答 1

Related

Reference