我正在尝试修改我现有的脚本之一,该脚本使用 uproot 将根文件中的数据读取到使用uproot.pandas.iterate
. 目前它只读取包含简单数据类型(浮点数、整数、布尔值)的分支,但我想添加读取一些存储 3x3 矩阵的分支的能力。我从阅读自述文件中了解到,在这种情况下,建议通过将结构flatten=True
作为参数传递给 iterate 函数来展平结构。但是,当我这样做时,它会崩溃:
Traceback (most recent call last):
File "genPreselTuples.py", line 338, in <module>
data = read_events(args.decaymode, args.tag, args.year, args.polarity, chunk=args.chunk, numchunks=args.numchunks, verbose=args.verbose, testing=args.testing)
File "genPreselTuples.py", line 180, in read_events
for df in uproot.pandas.iterate(filename_list, treename, branches=list(branchdict.keys()), entrysteps=100000, namedecode='utf-8', flatten=True):
File "/afs/cern.ch/work/d/djwhite/miniconda3/envs/D02HHHHml/lib/python3.8/site-packages/uproot/tree.py", line 117, in iterate
for start, stop, arrays in tree.iterate(branches=branchesinterp, entrysteps=entrysteps, outputtype=outputtype, namedecode=namedecode, reportentries=True, entrystart=0, entrystop=tree.numentries, flatten=flatten, flatname=flatname, awkwardlib=awkward, cache=cache, basketcache=basketcache, keycache=keycache, executor=executor, blocking=blocking):
File "/afs/cern.ch/work/d/djwhite/miniconda3/envs/D02HHHHml/lib/python3.8/site-packages/uproot/tree.py", line 721, in iterate
out = out()
File "/afs/cern.ch/work/d/djwhite/miniconda3/envs/D02HHHHml/lib/python3.8/site-packages/uproot/tree.py", line 678, in <lambda>
return lambda: uproot._connect._pandas.futures2df([(branch.name, interpretation, wrap_again(branch, interpretation, future)) for branch, interpretation, future, past, cachekey in futures], outputtype, start, stop, flatten, flatname, awkward)
File "/afs/cern.ch/work/d/djwhite/miniconda3/envs/D02HHHHml/lib/python3.8/site-packages/uproot/_connect/_pandas.py", line 162, in futures2df
array = array.view(awkward.numpy.dtype([(str(i), array.dtype) for i in range(functools.reduce(operator.mul, array.shape[1:]))])).reshape(array.shape[0])
ValueError: When changing to a larger dtype, its size must be a divisor of the total size in bytes of the last axis of the array.
我的代码如下:
# prepare for file reading
data = pd.DataFrame() # create empty dataframe to hold final output data
file_counter = 0 # count how many files have been processed
event_counter = 0 # count how many events were in input files that have been processed
# loop over files in filename_list & add contents to dataframe
for df in uproot.pandas.iterate(filename_list, treename, branches=list(branchdict.keys()), entrysteps=100000, namedecode='utf-8', flatten=True):
df.rename(branchdict, axis='columns', inplace=True) # rename branches to custom names (defined in dictionary)
file_counter += 1 # manage file counting
event_counter += df.shape[0] # manage event counting
print(df.head(10)) # debugging
# apply all cuts
for cut in cutlist:
df.query(cut, inplace=True)
# append events to dataframe of data
data = data.append(df, ignore_index=True)
# terminal output
print('Processed '+format(file_counter,',')+' chunks (kept '+format(data.shape[0],',')+' of '+format(event_counter,',')+' events ({0:.2f}%))'.format(100*data.shape[0]/event_counter), end='\r')
我已经能够使用它flatten=False
(在打印数据框时,它会将值分解为类似于此处显示的列:https ://github.com/scikit-hep/uproot#multiple-values-per-事件固定大小数组)。
eventNumber runNumber totCandidates nCandidate ... D0_SubVtx_234_COV_[1][2] D0_SubVtx_234_COV_[2][0] D0_SubVtx_234_COV_[2][1] D0_SubVtx_234_COV_[2][2]
0 13769776 177132 3 0 ... -0.016343 0.032616 -0.016343 0.470791
1 13769776 177132 3 1 ... -0.016343 0.032616 -0.016343 0.470791
2 13769776 177132 3 2 ... -0.016343 0.032616 -0.016343 0.470791
3 36250092 177132 2 0 ... 0.004726 -0.017212 0.004726 0.193447
4 36250092 177132 2 1 ... 0.004726 -0.017212 0.004726 0.193447
[5 rows x 296 columns]
但我从自述文件中了解到,不推荐不扁平化这些结构,至少出于速度目的 - 因为我有 O(10^8) 行要通过,所以速度有点令人担忧。我对造成这种情况的原因很感兴趣,因此我可以找到处理这些对象的最佳方法(并最终将它们写到一个新文件中)。谢谢!
编辑:我已将问题缩小到branches
选项。如果我手动指定一些分支(例如branches=['eventNumber, D0_SubVtx_234_COV_']
),那么它与 和 都可以正常flatten=True
工作False
。但是当使用 this 时list(branchdict.keys())
,它会给出原始问题顶部显示的 ValueError 。
我检查了这个列表,其中的所有元素都是真实的分支名称(否则它会给出 KeyError) - 它包含 206 个常规分支,其中一些包含标准数据类型,而其他包含长度为 1 的单个数据列表类型,外加 10 个包含类似 3x3 矩阵的分支。
如果我从这个列表中删除包含矩阵的分支,那么它会按预期工作。如果我只删除长度为 1 的列表,情况也是如此。每当我尝试读取包含这些长度为 1 列表和这些 3x3 矩阵的(单独的)分支时,就会发生崩溃。