python - 解析从 URL 读取的 JSON 时出现问题

Question

我遇到了一个我认为有一个简单解决方案的问题。

我正在编写一个 Python 脚本，它从 URL 读取 JSON 字符串并对其进行解析。为此，我使用 urllib2 和 simplejson。

我遇到的问题与编码有关。我正在读取的 URL 没有明确说明它是哪种编码（据我所知），它返回了一些冰岛字符。我无法给出我从这里读取的 URL，但我已经在自己的服务器上设置了一个示例 JSON 数据文件，但我在读取它时也遇到了问题。这是文件：http ://haukurhaf.net/json.txt

这是我的代码：

# coding: utf-8
#!/usr/bin/env python
import urllib2, re, os
from BeautifulSoup import BeautifulSoup
import simplejson as json

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3'

def fetchPage(url):
    req = urllib2.Request(url)
    req.add_header('User-Agent', user_agent)
    response = urllib2.urlopen(req)
    html = response.read()
    response.close()
    return html

html = fetchPage("http://haukurhaf.net/json.txt")
jsonData = json.JSONDecoder().decode(html)

JSON 解析器崩溃并显示以下错误消息：UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 35: invalid continuation byte

由于我无法控制保存 JSON 数据的服务器，因此我无法控制它发送的编码标头。我希望我能以某种方式解决这个问题。

有任何想法吗？

score 2 · Accepted Answer

该文件使用 Latin-1 编码，而不是 UTF-8，因此您必须指定编码：

jsonData = json.JSONDecoder('latin1').decode(html)

顺便说一句：html对于 JSON 文档来说是个坏名字......

score 1 · Accepted Answer

http://haukurhaf.net/json.txt

此资源编码为 ISO-8859-1，或者更有可能是 Windows 变体代码页 1252。它不是UTF-8。

您可以阅读它response.read().decode('cp1252')以获取一个[simple]json也应该能够解析的 Unicode 字符串。

但是，在字节形式中，JSON 必须以 UTF 编码。因此，这不是有效的 JSON，如果您也尝试从浏览器加载它，它也会失败。

score -1 · Accepted Answer

您需要先使字符串 unicode（现在是 latin-1）：

uhtml = html.decode("latin-1")
jdata = json.loads(uhtml)

或者，如果simplejson没有loads：

json.JSONDecoder().decode(uhtml)

python - 解析从 URL 读取的 JSON 时出现问题

3 回答 3

Related

Reference