I'm using ijson
to parse through large JSONs. I have this code, which should give me a dict
of values corresponding to the relevant JSON fields:
def parse_kvitems(kv_gen, key_list):
results = {}
for key in key_list:
results[key] = (v for k, v in kv_gen if k == key)
return results
with zipfile.ZipFile(fr'{directory}\{file}', 'r') as zipObj:
# Get a list of all archived file names from the zip
listOfFileNames = zipObj.namelist()
# Iterate over the file names
for fileName in listOfFileNames:
# Check filename endswith csv. dont extract, ijson wants bytes input and json.loads can run into memory issues with smash jsons.No documentation available
if fileName.endswith('.json'):
# Extract a single file from zip
with zipObj.open(fileName) as f:
#HERE:
records = ijson.kvitems(f, 'records.item')
data_list = ['id', 'features', 'modules', 'dbxrefs', 'description']
parsed_records = parse_kvitems(records, data_list) --> give me a dict of dict values that fall under the json headings in data_list
I think the kvitems object is acting like a generator and only making it through one run-through (I get the expected values for 'id'
, but the other data_list
keys in parsed_records
are empty).
To get around this I tried to make a list of duplicate kv_gen
's:
def parse_kvitems(kv_gen, key_list):
kv_list = [kv_gen] * len(key_list) #this bit
results = {}
for key, kv_gen in zip(key_list, kv_list):
results[key] = (v for k, v in kv_gen if k == key)
return results
This gave me the same error. I think mutability may be a culprit here, but I can't use copy
on the kvitems
object to see if this fixes it.
I then tried to use itertools.cycle()
, but this seems to be working in a way I don't understand:
def parse_kvitems(kv_gen, key_list):
infinite_kvitems = itertools.cycle(kv_gen)
results = {}
for key in key_list:
results[key] = (v for k, v in infinite_kvitems if k == key)
return results
Also, the below works (in the sense it gives me values that match what I see when I load a JSON with json.load()
):
records = ijson.kvitems(f, 'records.item')
ids = (v for k, v in records if k == 'id')
features = (v for k, v in records if k == 'features')
modules = (v for k, v in records if k == 'modules')
I'm just interested in why my function doesn't, especially when the records object is being run through multiple times above...
Edit for Rodrigo
You are not showing however how you find that your final dictionary has values for id but not for the other keys. I'm assuming it's only because you are iterating over the values under the parse_records['id'] values first. As you do so, the generator expression is then evaluated and the underlying kvitems generator is exhausted.
Yup, this is correct - I was converting each val to a list to check each key had a generator containing the same number of items, as I was worried a downstream zip operation might truncate some values if they had more objects than the smallest generator.
I didn't convert to a list in the function as I thought a generator would be a better object to return (less memory intensive etc), which I could then convert to a list of I needed to outside the function.
You say that your finally piece of code works as expected. This is the only bit that surprises me, specially if you really, really inspected (i.e., evaluated) all three of the generator expressions after you created them. If you could clarify if that's the case it would be interesting; otherwise if you created all three generator expressions, but then evaluated one or the other, then there are no surprises here (because of the "About result collection" explanation).
Basically, it gave me the values I was expecting when I ran through the items as a zipped collection of generators and appended the items to a list. But this might need some more investigation, the JSONs are quite convoluted so I might have missed something.