我正在尝试使用从外部代码返回的 cuda (在 CUDA-land 中称为 a )作为Accelerator - DevicePtr
llvm - ptx。CUdeviceptr
Array
我在下面编写的代码有些工作:
import Data.Array.Accelerate
(Acc, Array, DIM1, Z(Z), (:.)((:.)), use)
import qualified Data.Array.Accelerate as Acc
import Data.Array.Accelerate.Array.Data
(GArrayData(AD_Float), unsafeIndexArrayData)
import Data.Array.Accelerate.Array.Sugar
(Array(Array), fromElt, toElt)
import Data.Array.Accelerate.Array.Unique
(UniqueArray, newUniqueArray)
import Data.Array.Accelerate.LLVM.PTX (run)
import Foreign.C.Types (CULLong(CULLong))
import Foreign.CUDA.Driver (DevicePtr(DevicePtr))
import Foreign.ForeignPtr (newForeignPtr_)
import Foreign.Ptr (intPtrToPtr)
-- A foreign function that uses cuMemAlloc() and cuMemCpyHtoD() to
-- create data on the GPU. The CUdeviceptr (initialized by cuMemAlloc)
-- is returned from this function. It is a CULLong in Haskell.
--
-- The data on the GPU is just a list of the 10 floats
-- [0.0, 1.0, 2.0, ..., 8.0, 9.0]
foreign import ccall "mytest.h mytestcuda"
cmyTestCuda :: IO CULLong
-- | Convert a 'CULLong' to a 'DevicePtr'.
--
-- A 'CULLong' is the type of a CUDA @CUdeviceptr@. This function
-- converts a raw 'CULLong' into a proper 'DevicePtr' that can be
-- used with the cuda Haskell package.
cullongToDevicePtr :: CULLong -> DevicePtr a
cullongToDevicePtr = DevicePtr . intPtrToPtr . fromIntegral
-- | This function calls 'cmyTestCuda' to get the 'DevicePtr', and
-- wraps that up in an accelerate 'Array'. It then uses this 'Array'
-- in an accelerate computation.
accelerateWithDataFromC :: IO ()
accelerateWithDataFromC = do
res <- cmyTestCuda
let DevicePtr ptrToXs = cullongToDevicePtr res
foreignPtrToXs <- newForeignPtr_ ptrToXs
uniqueArrayXs <- newUniqueArray foreignPtrToXs :: IO (UniqueArray Float)
let arrayDataXs = AD_Float uniqueArrayXs :: GArrayData UniqueArray Float
let shape = Z :. 10 :: DIM1
xs = Array (fromElt shape) arrayDataXs :: Array DIM1 Float
ys = Acc.fromList shape [0,2..18] :: Array DIM1 Float
usedXs = use xs :: Acc (Array DIM1 Float)
usedYs = use ys :: Acc (Array DIM1 Float)
computation = Acc.zipWith (+) usedXs usedYs
zs = run computation
putStrLn $ "zs: " <> show z
编译并运行该程序时,它正确打印出结果:
zs: Vector (Z :. 10) [0.0,3.0,6.0,9.0,12.0,15.0,18.0,21.0,24.0,27.0]
但是,通过阅读加速和加速-llvm-ptx 源代码,这似乎不应该工作。
在大多数情况下,加速器似乎Array
带有指向主机内存中数组数据的指针,以及Unique
唯一标识Array
. 在执行Acc
计算时,Acceleration 会根据需要将 HOST 内存中的数组数据加载到 GPU 内存中,并HashMap
使用Unique
.
在上面的代码中,我Array
直接创建了一个指向 GPU 数据的指针。这似乎不应该工作,但它似乎在上面的代码中工作。
然而,有些事情是行不通的。例如,尝试打印xs
(我Array
的带有指向 GPU 数据的指针)会因段错误而失败。这是有道理的,因为Show
for 的实例Array
只是尝试peek
从 HOST 指针获取数据。这失败了,因为它不是 HOST 指针,而是 GPU 指针:
-- Trying to print xs causes a segfault.
putStrLn $ "xs: " <> show xs
有没有合适的方法来使用 CUDADevicePtr
并直接将其用作加速器Array
?