我正在通过 Twitter 流 API 提取推文并对文本进行标记。我想存储一切。我当前的代码涉及
with open("tweets.json", "a") as f:
f.write(cjson.encode(tweets))
但是,当我尝试decode
同一个文件时,我得到了错误!它是 twitter,所以实际内容到处都是——url、unicode 等。是否有类似的内容,例如re.escape
JSON?我对 JSON 的了解还不够多,无法写一些东西来逃避每一个潜在的美中不足,我也不想花时间。我阅读了有关strict
参数的信息,但我不确定这是否足够。
ETA:这是每个人都在要求的示例代码。对不起,我含糊其辞:
[["Just", "hanging", "with", "my", "cousins", "#tbt", "#adorable", "#grandmashouse", "@jess_lufrano", "@gabalvarezxo", "@robbybacs", "http://t.co/wgDntda7WB"], ["going", "to", "do", "things.", "Horrible", "things.", "Things", "done", "only", "in", "nightmares.", ">:>", "#muhahaha"], ["#truelove", "http://t.co/fEfT797Xit"], ["IMG_5667:", "Savini", "Francesco", "has", "added", "a", "photo", "to", "the", "pool:", "", "http://t.co/XYFsFIHG3M", "#national", "#pics"], ["I", "would", "rather", "11", "million", "Romanians", "and", "Bulgarians", "in", "Bromsgrove", "than", "one", "Sajid", "Javid", "#bbcqt"], ["lol", "Fuck", "around", "been", "the", "midgets!", "#OH", "#NO"], ["TODAY's", "SHOW:", "@markMGgeyer", "&", "@GusWorland's", "trip", "to", "Gallipoli", "on", "#anzacday", "+", "Sad", "revelations", "about", "Jon", "Mannah", "+", "Ray", "Martin."], ["Using", "valued", "objects", "for", "currency", "is", "fascinating.", "I", "want", "to", "see", "that", "really", "explored.", "#doctorwho"], ["@KevinMallonTri", "ya", "buddy!", "You", "know", "I'm", "ready..I", "leave", "tomorrow.#ready2Race"], ["My", "mom", "has", "two", "different", "lights", "with", "two", "different", "colour", "temps", "and", "it", "bugs", "me.", "I", "think", "there", "is", "something", "wrong", "with", "me.", "#filmkidproblems"], ["#Golf", "#PGA", "Quail", "Hollow", "bullish", "despite", "greens,", "no", "Tiger", "Woods", "-", "Charlotte", "Business", "Journal...", "http://t.co/UWn98AwpGT", "#MustFollow", "TWNews"], ["So", "what's", "the", "next", "#jam", "theme?"], ["#Me", "&", "my", "#homegirl", "solange", "#throwback", "#tbt", "#picoftheday", "#photo", "#instapic", "#instabomb", "#years", "#ago", "#boat\u2026", "http://t.co/86X0A2xRDa"],...
(注意:我截断了样本,但我仔细检查了它并以 结尾]]
,就像我很确定它应该那样。再说一次,我不完全是 Cap'n Json。)
和错误:
decoder.decode(open("tweets.json").read())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 382, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Invalid \escape: line 1 column 3243 (char 3243)
实际上,当我们在这里时:各种 Python JSON 库(simplejson/json、cjson、ujson 等)w/r/t 这种东西有什么区别吗?它们中的任何一个在编码端“逃避”/在解码端更灵活吗?我不太关心速度,只关心不麻烦。