1

我正在使用 pythong 版本 3.6.5 并且有一个具有多维结构的锯齿状 TTree。这些数据分布在 1000 多个文件中,所有文件都具有相同的 TTree 结构。

假设我有两个文件,我将它们命名为 fname1.root fname2.root

以下代码自行打开其中任何一个都没有问题:

import uproot as upr
import awkward
import boost_histogram as bh
import math
import matplotlib.pyplot as plt
#
# define a plotting program
# def plotter(h)
#
# preparing the file location for files
pth = '/fullpathName/'
fname1 = 'File755.root'
fname2 = 'File756.root'
fileList = [pth+fname1, pth+fname2]
#
# print out the path and filename that I've made to show the user
for file in fileList:
    print(file)
print('\n')
#
# Let's make a histogram This one has 50 bins, starts at zero and ends at 1000.0
# It will be a histogram of Jet pT's. 
jhist = bh.histogram(bh.axis.regular(50,0.0,1000.0))
#
#show what you've just done
print(jhist)
#
# does not work, only fills first file!
for chunk in upr.iterate(fileList,"bTag_AntiKt4EMTopoJets",["jet_pt"]):
    jhist.fill(chunk[b"jet_pt"][:, :2].flatten()*0.001)
#
#
# what does my histogram look like?
ptHist = plt.bar(jhist.axes[0].centers, jhist.view(), width=jhist.axes[0].widths)
plt.show()

正如我所说,如果我在“fileList”中只放一个文件,上面的代码就可以工作。

幼稚的做法是行不通的。如果我使用创建文件“列表”

files = [pth+fname1 , pth+fname2]

并重新运行该代码。我收到以下错误...这与我一直遇到的错误非常相似。

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 48, in <module>
  File "/home/huffman/.local/lib/python3.6/site-packages/uproot/tree.py", line 116, in iterate
    for tree, branchesinterp, globalentrystart, thispath, thisfile in _iterate(path, treepath, branches, awkward, localsource, xrootdsource, httpsource, **options):
  File "/home/huffman/.local/lib/python3.6/site-packages/uproot/tree.py", line 163, in _iterate
    file = uproot.rootio.open(path, localsource=localsource, xrootdsource=xrootdsource, httpsource=httpsource, **options)
  File "/home/huffman/.local/lib/python3.6/site-packages/uproot/rootio.py", line 54, in open
    return ROOTDirectory.read(openfcn(path), **options)
  File "/home/huffman/.local/lib/python3.6/site-packages/uproot/rootio.py", line 51, in <lambda>
    openfcn = lambda path: MemmapSource(path, **kwargs)
  File "/home/huffman/.local/lib/python3.6/site-packages/uproot/source/memmap.py", line 21, in __init__
    self._source = numpy.memmap(self.path, dtype=numpy.uint8, mode="r")
  File "/cvmfs/sft.cern.ch/lcg/views/LCG_94python3/x86_64-slc6-gcc8-opt/lib/python3.6/site-packages/numpy/core/memmap.py", line 264, in __new__
    mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
OSError: [Errno 12] Cannot allocate memory
4

1 回答 1

1

惰性数组只是一个方便的接口——即您可以通过一个函数调用来转换它,而不是在块上的显式循环中进行迭代。在内部,惰性数组包含对块的隐式循环,因此如果您以一种方式耗尽内存,那么您将采用另一种方式。

您的问题不是关闭文件(它们是内存映射的,因此“关闭”没有明确的含义-无论如何,它们是操作系统为自己分配的内存视图)-您的问题是删除数组。这是唯一可以用完计算机上所有内存的东西。

您可以在这里做几件事:一是

for chunk in uproot.iterate(files, "bTag_AntiKt4EMTopoJets", ["jet_pt", "jet_eta"]):
    # fill with chunk[b"jet_pt"] and chunk[b"jet_eta"], which correspond
    # to the same sets of events, one-to-one.

显式循环块(“显式”,因为您在这里看到并控制循环,并且因为您必须指定要加载到 dict 的分支chunk)。您可以使用 控制块的大小entrysteps。另一种是

cache = uproot.ArrayCache("1 GB")
events = uproot.lazyarrays(files, "bTag_AntiKt4EMTopoJets", cache=cache)

保持循环隐式。如果达到 1 GB的ArrayCache限制,它将抛出数组块,因此必须再次加载它们。如果你把这个限制设置得太小,它将无法容纳一个块,但如果你把它设置得太大,你就会耗尽内存。

顺便说一句,尽管您报告了内存问题,但您的代码还有另一个主要的性能问题:您正在查看 Python 中的事件。代替

events.jet_pt[i][:2]*0.001

要获得事件的喷气机 pT i,请执行

events.jet_pt[:, :2]*0.001

将所有事件的 jet pT 作为单个数组。然后,您可能需要.flatten()该数组以适应直方图的fill方法。

于 2019-11-01T19:30:30.427 回答