python - json.loads 失败并显示“无法解码任何 JSON 对象”

Question

我正在使用 Scrapy 进行网络抓取。网站在<code>标签之间嵌入了 json，例如：

<code id="content" style="display:none;"><!--{"content": "text1",...,..., "compute": "text2"}--></code>

使用 xpath，我能够提取<code>标签内的注释。用过的：

hxs.select("//code[@id='content']/comment()").extract()

条带注释字符后，内容有content = "{"content": "text1",...,..., "compute": "text2"}"

使用 json.loads(content) 构建 json 时，出现"ValueError: No JSON object could be decoded"错误。

此外， str(content) 抛出：

"UnicodeEncodeError: 'ascii' codec can't encode characters in position 106512-106513: ordinal not in range(128)"

106512 处的值是'\xa7'

提前致谢。

score 2 · Accepted Answer

str(content)非 ASCII 字符失败是意料之中的，这本身不是问题。content.encode('utf-8')如果您想要的是字节字符串（尽管将其打印到控制台是另一回事（PrintFails）。如果您只想向我们展示变量中的内容，请打印repr(comment)以获取 Python 语法表示。

No JSON object could be decoded意味着json.loads在字符串的开头甚至无法开始找到看起来像 JSON 的内容，因此请查看该repr()字符串的前面是否有任何杂散字符或控制代码在{.

score 0 · Accepted Answer

json 中的字符串似乎是 iso8859 或 Windows-1252。\xa7 是其中一种编码中的 §，而 \xc2\xa7 是 utf-8 中的 §。

python - json.loads 失败并显示“无法解码任何 JSON 对象”

2 回答 2

Related

Reference