2

I have some questions about Just-In-Time (JIT) compilation with CUDA.

I have implemented a library based on Expression Templates according to the paper

J.M. Cohen, "Processing Device Arrays with C++ Metaprogramming", GPU Computing Gems - Jade Edition

It seems to work fairly good. If I compare the computing time of the matrix elementwise operation

D_D=A_D*B_D-sin(C_D)+3.;

with that of a purposely developed CUDA kernel, I have the following results (in parentheses, the matrix size):

time [ms] hand-written kernel: 2.05 (1024x1024) 8.16 (2048x2048) 57.4 (4096*4096)

time [ms] LIBRARY: 2.07 (1024x1024) 8.17 (2048x2048) 57.4 (4096*4096)

The library seems to need approximately the same computing time of the hand-written kernel. I'm also using the C++11 keyword auto to evaluate expressions only when they are actually needed, according to Expression templates: improving performance in evaluating expressions?. My first question is

1. Which kind of further benefit (in terms of code optimization) would JIT provide to the library? Would JIT introduce any further burdening due to runtime compilation?

It is known that a library based on Expression Templates cannot be put inside a .dll library, see for example http://social.msdn.microsoft.com/Forums/en-US/vcgeneral/thread/00edbe1d-4906-4d91-b710-825b503787e2. My second question is:

2. Would JIT help in hiding the implementation to a third-party user? If yes, how?

The CUDA SDK include the ptxjit example in which the ptx code is not loaded at runtime, but defined at compile time. My third question is:

3. How should I implement JIT in my case? Are there examples of JIT using PTX loaded at runtime?

Thank you very much for any help.

EDIT following Talonmies' comment

From the Cuda kernel just-in-time (jit) compilation possible? post, it reads that

cuda code can be compiled to an intermediate format ptx code, which will then be jit-compiled to the actual device architecture machine code at runtime

A doubt I have is whether the above can be applied to an Expression Templates library. I know that, due to instantiation problems, a CUDA/C++ template code cannot be compiled to a PTX. But perhaps if I instantiate all the possible combinations of Type/Operators for Unary and Binary Expressions, at least a part of the implementation can be compiled (and then masked to third-party users) to PTX which can be in turn JIT compiled to the architecture at hand.

4

2 回答 2

1

我认为您应该研究OpenCL。它提供了一个类似 JIT 的编程模型,用于在 GPU 上创建、编译和执行计算内核(全部在运行时)。

我在Boost.Compute中采用了一种类似的基于表达式模板的方法,它允许库通过将编译型 C++ 表达式转换为 OpenCL 内核代码(这是 C 的一种方言)来支持 C++ 模板和通用算法。

于 2013-04-14T00:37:58.327 回答
0

VexCL最初是作为 OpenCL 的表达式模板库,但从 v1.0 开始它也支持 CUDA。它对 CUDA 所做的正是 CUDA 源代码的 JIT 编译。nvcc在后台调用编译器,编译后的 PTX 存储在离线缓存中,并在程序的后续启动时加载。有关如何执行此操作的信息,请参阅CUDA 后端源代码。compiler.hpp应该是您最感兴趣的。

于 2014-01-25T10:05:06.690 回答