0

我需要在 nvidia 的 rapids cudf 库的定义块中为 sklearn 的内核密度函数指定 dtype(数据类型)。在 Python 3.7 中,我能够找到类型信息,但由于某种原因,它不被认为是 nvidia 的 rapids def 块接受的数据类型。我在下面包含了我的代码和错误消息,以便任何人都可以重现错误消息。

下面是内核密度函数的典型实现代码:

from sklearn.neighbors import KernelDensity
import numpy as np

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
kde = KernelDensity(kernel='gaussian', bandwidth=0.2).fit(X)
kde.score_samples(X)
     array([-0.41075698, -0.41075698, -0.41076071, -0.41075698, -0.41075698,
    -0.41076071])

type(kde)
     <class 'sklearn.neighbors.kde.KernelDensity'>

这是我与 Sklearn 的内核密度函数一起使用的 NVIDIA Rapids Def 块:

import cudf, math
import numpy as np

df = cudf.DataFrame()
nelem = 10
df['in1'] = np.arange(nelem) * 1.5
df['in2'] = np.arange(nelem) * 1.45


#Define input columns for the kernel

in1 = df['in1']
in2 = df['in2']

def kernel(in1, in2, out1, out2, out3, out4, kwarg1, kwarg2):
    for i, (x, y) in enumerate(zip(in1, in2)):
        out1[i] = [math.tan(i) for i in x]
        out2[i] = np.array(out1[i].to_pandas())
        out3[i] = ((KernelDensity(kernel='gaussian', bandwidth=kwarg1).fit(out2[i])).score_samples(out2[i]))
        out4[i] = [i >= kwarg2 for i in out3[i]]

Results = cudf.DataFrame()
Results = df.apply_rows(kernel, incols=['in1','in2'], outcols=dict(out1='float', out2='float64', out3='float64', out4='float'), kwargs=dict(kwarg1=0.1, kwarg2=0.33))

这是错误消息(也许如果我得到正确的 x 和 out3 的 dtype,这将解决所有错误):

 Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/cudf/dataframe/dataframe.py", line 2707, in apply_rows
self, func, incols, outcols, kwargs, cache_key=cache_key
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/cudf/utils/applyutils.py", line 64, in apply_rows return applyrows.run(df)
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/cudf/utils/applyutils.py", line 128, in run self.launch_kernel(df, bound.args, **launch_params)
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/cudf/utils/applyutils.py", line 152, in launch_kernel self.kernel[blkct, blksz](*args)
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/cuda/compiler.py", line 806, in __call__ kernel = self.specialize(*args)
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/cuda/compiler.py", line 817, in specialize kernel = self.compile(argtypes)
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/cuda/compiler.py", line 833, in compile **self.targetoptions)
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/compiler_lock.py", line 32, in _acquire_compile_lock return func(*args, **kwargs)
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/cuda/compiler.py", line 62, in compile_kernel
cres = compile_cuda(pyfunc, types.void, args, debug=debug, inline=inline)
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/compiler_lock.py", line 32, in _acquire_compile_lock, return func(*args, **kwargs)
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/cuda/compiler.py", line 51, in compile_cuda, locals={})
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/compiler.py", line 972, in compile_extra, return pipeline.compile_extra(func)
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/compiler.py", line 390, in compile_extra, return self._compile_bytecode()
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/compiler.py", line 903, in _compile_bytecode, return self._compile_core()
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/compiler.py", line 890, in _compile_core, res = pm.run(self.status)
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/compiler_lock.py", line 32, in _acquire_compile_lock, return func(*args, **kwargs)
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/compiler.py", line 266, in run
raise patched_exception
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/compiler.py", line 257, in run
stage()
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/compiler.py", line 515, in stage_nopython_frontend self.locals)
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/compiler.py", line 1124, in type_inference_stage, infer.propagate()
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/typeinfer.py", line 927, in propagate, raise errors[0]
numba.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Invalid use of Function(<numba.cuda.compiler.DeviceFunctionTemplate object at 0x7f2679e6f9e8>) with argument(s) of type(s): (array(float64, 1d, A), array(float64, 1d, A), array(float64, 1d, A), array(float64, 1d, A), array(float64, 1d, A), array(float64, 1d, A), float64, float64) * parameterized

In definition 0:
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Untyped global name 'x': cannot determine Numba type of <class 'numba.ir.UndefinedType'>

File "<stdin>", line 2:
<source missing, REPL/exec in use?>

raised from /anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/typeinfer.py:1254

In definition 1:
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Untyped global name 'x': cannot determine Numba type of <class 'numba.ir.UndefinedType'>

File "<stdin>", line 2:
<source missing, REPL/exec in use?>

raised from /anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/typeinfer.py:1254
This error is usually caused by passing an argument of a type that is unsupported by the named function.
[1] During: resolving callee type: Function(<numba.cuda.compiler.DeviceFunctionTemplate object at 0x7f2679e6f9e8>)
[2] During: typing of call at <string> (11)


 File "<string>", line 11:
 <source missing, REPL/exec in use?>
4

1 回答 1

2

有效的代码如下。您的某些行与 cudf 不兼容:

  1. 单独使用 i 而不是索引是行不通的。它始终为零。因此 out1 也是零
  2. sklearn 中的类与 numba nopython 模式不兼容。这适用于任何 numba 不特别支持的库。我不知道任何包含 numba 支持的内核密度估计的库。支持 Numpy,但它没有内核密度估计。
  3. df.apply_rows() 不允许将函数应用于多行,这是计算内核密度所需的。您可能需要使用 df.apply_chunks()。

要实现内核密度估计,您需要:

  1. 使用 df.apply_chunks()
  2. 创建一个将计算内核密度的自定义函数。您可以使用此代码的一部分来创建您的函数:KernelDensity source code
  3. 自定义函数应该能够将内核应用于 np.array 以计算每个窗口的值
  4. 应设置 apply_chunks() 函数,以便块是滚动窗口

代码:

import cudf, math
import numpy as np

df = cudf.DataFrame()
nelem = 10
df['in1'] = np.arange(nelem) * 1.5
df['in2'] = np.arange(nelem) * 1.45


#Define input columns for the kernel

in1 = df['in1']
in2 = df['in2']

def kernel(in1, in2, out1, out2, out3, out4, kwarg1, kwarg2):
    for i, (x, y) in enumerate(zip(in1, in2)):
        out1[i] = math.tan(float(i)) 
        out2[i] = out1[i]
        out3[i] = 1 #((KernelDensity(kernel='gaussian', bandwidth=kwarg1).fit(out2[i])).score_samples(out2[i]))
        out4[i] = out3[i] >= kwarg2 

Results = cudf.DataFrame()
Results = df.apply_rows(kernel, incols=['in1','in2'], outcols=dict(out1=np.float64, out2=np.float64, out3=np.float64, out4=np.float64), kwargs=dict(kwarg1=0.1, kwarg2=0.33))
于 2019-12-06T20:21:18.357 回答