1

I have a singe level nested array, and I'd like to calculate the running sum at the deepest level:

<JaggedArray [[0.8143442176354316 0.18565578236456845] [1.0] [0.8029232081440607 0.1970767918559393] ... [0.42036116755776154 0.5796388324422386] [0.18512572262194366 0.31914669745950724 0.13598232751162054 0.3597452524069286] [0.34350475143310905 0.19023361856972956 0.4662616299971615]] at 0x7f8969e32af0>

after doing something like numpy.cumsum(jagged_array) I'd like to have:

[[0.8143442176354316 1.0] [1.0] [0.8029232081440607 1.0] ...

In short - the running sum at the deepest level (which is restarted with each new "event").

I'm using awkard0, and the documentation says that broadcast is run at the deepest level, however, I get an error when I tried just handing a JaggedArray directly to numpy.cumsum: operands could not be broadcast together with shapes (2,) (3,)

The dataset is large - I'd like to stay within the awkward system - so avoid python loops in processing these.

4

2 回答 2

1

认为您只是想在较大列表中的每个列表上调用 np.cumsum 。如果我误解了你的意图,请告诉我。

在这种情况下

result = [np.cumsum(one_list) for one_list in jagged_array]
于 2020-12-06T23:05:01.270 回答
0

没有一种“高级”方法可以做到这一点,一种独立于数组布局知识的方法,但我可以引导您完成此操作。

尴尬的 0.x(已过时)

假设你有一个简单的锯齿状数组,

>>> import awkward0
>>> import numpy as np
>>> array = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
>>> array.layout
 layout 
[    ()] JaggedArray(starts=layout[0], stops=layout[1], content=layout[2])
[     0]   ndarray(shape=3, dtype=dtype('int64'))
[     1]   ndarray(shape=3, dtype=dtype('int64'))
[     2]   ndarray(shape=5, dtype=dtype('float64'))

您可以将累积总和应用于content

>>> np.cumsum(array.content)
array([ 1.1,  3.3,  6.6, 11. , 16.5])

并将其包装为一个新的锯齿状数组:

>>> scan = awkward0.JaggedArray.fromoffsets(array.offsets, np.cumsum(array.content))
>>> scan
<JaggedArray [[1.1 3.3000000000000003 6.6] [] [11.0 16.5]] at 0x7f0621a826a0>

尴尬的 1.x

我们在 Awkward 0.x 中直接操作的offsetsandcontent结构现在隐藏在“布局”中,以区分高级操作(不需要知道确切的布局)和低级操作(需要知道)。这个问题没有高层的解决方案,低层的方式和上面一样,但是涉及到额外的wrapping和unwrapping。

>>> import awkward as ak
>>> import numpy as np
>>> array = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
>>> array
<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>
>>> layout = array.layout
>>> layout
<ListOffsetArray64>
    <offsets><Index64 i="[0 3 3 5]" offset="0" length="4" at="0x55737ef6f880"/></offsets>
    <content><NumpyArray format="d" shape="5" data="1.1 2.2 3.3 4.4 5.5" at="0x55737ef71890"/></content>
</ListOffsetArray64>

和以前一样,您可以对 进行累积求和content

>>> np.cumsum(layout.content)
array([ 1.1,  3.3,  6.6, 11. , 16.5])

这是它如何被包装的结构:

>>> scan = ak.Array(
...     ak.layout.ListOffsetArray64(
...         layout.offsets,
...         ak.layout.NumpyArray(
...             np.cumsum(layout.content)
...         )
...     )
... )
...
>>> scan
<Array [[1.1, 3.3, 6.6], [], [11, 16.5]] type='3 * var * float64'>

如果您想要按列表扫描怎么办?

如果您想要一个类似于 Frank Yellin 的解决方案,其中每次扫描都在每个列表中开始新的扫描,那么我们在np.cumsum上进行扫描的事实content是一个问题。具体来说,我们有第三个列表以11, 而不是4.4.

一种矢量化的方法是scan从整个列表中减去每个列表的第一个元素,然后将第一个array元素重新添加进去。在 Awkward 0.x 和 1.x 中,这可以通过像array[:, 0]和广播这样的切片来完成,但为空列表(如果你有的话)将是一个问题。尴尬的 1.x 有足够的替代方案来解决这个问题:

>>> ak.firsts(scan)
<Array [1.1, None, 11] type='3 * ?float64'>

>>> scan - ak.firsts(scan)
<Array [[0, 2.2, 5.5], None, [0, 5.5]] type='3 * option[var * float64]'>

>>> scan - ak.firsts(scan) + ak.firsts(array)
<Array [[1.1, 3.3, 6.6], None, [4.4, 9.9]] type='3 * option[var * float64]'>

>>> ak.fill_none(scan - ak.firsts(scan) + ak.firsts(array), [])
<Array [[1.1, 3.3, 6.6], [], [4.4, 9.9]] type='3 * var * float64'>

其中大多数在 Awkward 0.x 中没有等价物。

于 2020-12-07T16:11:58.207 回答