我正在尝试编写一个类似于 pandas 的函数的groupby().ngroups()
函数。不同之处在于我希望每个子组计数从 0 重新开始。所以给出以下数据:
| EVENT_1 | EVENT_2 |
| ------- | ------- |
| 0 | 3 |
| 0 | 3 |
| 0 | 3 |
| 0 | 5 |
| 0 | 5 |
| 0 | 5 |
| 0 | 9 |
| 0 | 9 |
| 0 | 9 |
| 1 | 6 |
| 1 | 6 |
我想
| EVENT_1 | EVENT_2 | EVENT_2A |
| ------- | ------- | -------- |
| 0 | 3 | 0 |
| 0 | 3 | 0 |
| 0 | 3 | 0 |
| 0 | 5 | 1 |
| 0 | 5 | 1 |
| 0 | 5 | 1 |
| 0 | 9 | 2 |
| 0 | 9 | 2 |
| 1 | 6 | 0 |
| 1 | 6 | 0 |
我能想到的最好的实现方法是groupby()
在 EVENT_1 上执行一个,在每个组中获取 EVENT_2 的唯一值,然后将 EVENT_2A 设置为唯一值的索引。例如,在EVENT_1 == 0
组中,唯一值是[3, 5, 9]
,然后我们将 EVENT_2A 设置为唯一值列表中的索引,以获取 EVENT_2 中的相应值。
我写的代码在这里。请注意,EVENT_2 始终相对于 EVENT_1 进行排序,因此在 O(n) 中找到像这样的唯一值应该可以工作。
import cudf
from numba import cuda
import numpy as np
def count(EVENT_2, EVENT_2A):
# Get unique values of EVENT_2
uq = [EVENT_2[0]] + [x for i, x in enumerate(EVENT_2) if i > 0 and EVENT_2[i-1] != x]
for i in range(cuda.threadIdx.x, len(EVENT_2), cuda.blockDim.x):
# Get corresponding index for each value. This can probably be sped up by mapping
# values to indices
for j, v in enumerate(uq):
if v == EVENT_2[i]:
EVENT_2A[i] = j
break
if __name__ == "__main__":
data = {
"EVENT_1":[0,0,0,0,0,0,0,0,1,1],
"EVENT_2":[3,3,3,5,5,5,9,9,6,6]
}
df = cudf.DataFrame(data)
results = df.groupby(["EVENT_1"], method="cudf").apply_grouped(
count,
incols=["EVENT_2"],
outcols={"EVENT_2A":np.int64}
)
print(results.sort_index())
问题在于,在用户定义函数中使用列表似乎存在错误count()
。Numba 说它的 JIT nopython 编译器可以处理列表理解,实际上当我使用该函数时
from numba import jit
@jit(nopython=True)
def uq_sorted(my_list):
return [my_list[0]] + [x for i, x in enumerate(my_list) if i > 0 and my_list[i-1] != x]
它可以工作,尽管有弃用警告。
我使用 cudf 得到的错误是
No implementation of function Function(<numba.cuda.compiler.DeviceFunctionTemplate object at 0x7f782a179fa0>) found for signature:
>>> count <CUDA device function>(array(int64, 1d, C), array(int64, 1d, C))
There are 2 candidate implementations:
- Of which 2 did not match due to:
Overload in function 'count <CUDA device function>': File: ../../../../test.py: Line 11.
With argument(s): '(array(int64, 1d, C), array(int64, 1d, C))':
Rejected as the implementation raised a specific error:
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Unknown attribute 'append' of type list(undefined)<iv=None>
File "test.py", line 12:
def count(EVENT_2, EVENT_2A):
uq = [EVENT_2[0]] + [x for i, x in enumerate(EVENT_2) if i > 0 and EVENT_2[i-1] != x]
^
During: typing of get attribute at test.py (12)
File "test.py", line 12:
def count(EVENT_2, EVENT_2A):
uq = [EVENT_2[0]] + [x for i, x in enumerate(EVENT_2) if i > 0 and EVENT_2[i-1] != x]
^
raised from /project/conda_env/lib/python3.8/site-packages/numba/core/typeinfer.py:1071
During: resolving callee type: Function(<numba.cuda.compiler.DeviceFunctionTemplate object at 0x7f782a179fa0>)
During: typing of call at <string> (10)
File "<string>", line 10:
<source missing, REPL/exec in use?>
这与 numba 的弃用警告有关吗?即使我设置uq
为静态列表,我仍然会收到错误消息。欢迎对列表理解问题或我的整个问题的任何解决方案。谢谢。