python - 2.2GB JSON 文件解析不一致

Question

我正在尝试解码一个大的 utf-8 json 文件（2.2 GB）。我像这样加载文件：

f = codecs.open('output.json', encoding='utf-8')
data = f.read()

如果我尝试执行以下任何操作：json.load，json.loads或者json.JSONDecoder().raw_decode我收到错误：

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-40-fc2255017b19> in <module>()
----> 1 j = jd.decode(data)

/usr/lib/python2.7/json/decoder.pyc in decode(self, s, _w)
    367         end = _w(s, end).end()
    368         if end != len(s):
--> 369             raise ValueError(errmsg("Extra data", s, end, len(s)))
    370         return obj
    371

ValueError: Extra data: line 1 column -2065998994 - line 1 column 2228968302
    (char -2065998994 - 2228968302)

uname -m节目x86_64和

> python -c 'import sys;print("%x" % sys.maxsize, sys.maxsize > 2**32)'
('7fffffffffffffff', True)`

所以我应该是 64 位的，整数大小应该不是问题。

但是，如果我运行：

jd = json.JSONDecoder()
len(data) # 2228968302
j = jd.raw_decode(data)
j[1] # 2228968302

返回的元组中的第二个值raw_decode是字符串的结尾，因此raw_decode似乎解析了整个文件，最后似乎没有垃圾。

那么，我应该对 json 做些什么不同的事情吗？实际上是在raw_decode解码整个文件吗？为什么会json.load(s)失败？

score 10 · Accepted Answer

I'd add this as a comment, but the formatting capabilities in comments are too limited.

Staring at the source code,

raise ValueError(errmsg("Extra data", s, end, len(s)))

calls this function:

def errmsg(msg, doc, pos, end=None):
    ...
    fmt = '{0}: line {1} column {2} - line {3} column {4} (char {5} - {6})'
    return fmt.format(msg, lineno, colno, endlineno, endcolno, pos, end)

The (char {5} - {6}) part of the format is this part of the error message you showed:

(char -2065998994 - 2228968302)

So, in errmsg(), pos is -2065998994 and end is 2228968302. Behold! ;-):

>>> pos = -2065998994
>>> end = 2228968302
>>> 2**32 + pos
2228968302L
>>> 2**32 + pos == end
True

That is, pos and end are "really" the same. Back from where errmsg() was called, that means end and len(s) are really the same too - but end is being viewed as a 32-bit signed integer. end in turn comes from a regular expression match object's end() method.

So the real problem here appears to be a 32-bit limitation/assumption in the regexp engine. I encourage you to open a bug report!

Later: to answer your questions, yes, raw_decode() is decoding the entire file. The other methods call raw_decode(), but add the (failing!) sanity checks afterwards.

python - 2.2GB JSON 文件解析不一致

1 回答 1

Related

Reference