c++ - cuModuleGetFunction 返回未找到

Question

我想用 nvrtc JIT 编译器编译 CUDA 内核来提高我的应用程序的性能（所以我有更多的指令获取，但我保存了多个数组访问）。

这些函数看起来像这样，由我的函数生成器生成（不是那么重要）：

extern "C" __device__ void GetSumOfBranches(double* branches, double* outSum)
{
    double sum = (branches[38])+(-branches[334])+(-branches[398])+(-branches[411]);
    *outSum = sum;
}

我正在使用以下函数编译上面的代码：

CUfunction* FunctionGenerator::CreateFunction(const char* programText)
{
        // When I comment this statement out the output of the PTX file is changing
        // what is the reson?!
        // Bug?
        std::string savedString = std::string(programText);


        nvrtcProgram prog;
        nvrtcCreateProgram(&prog, programText, "GetSumOfBranches.cu", 0, NULL, NULL);

        const char *opts[] = {"--gpu-architecture=compute_52", "--fmad=false"};
        nvrtcCompileProgram(prog, 2, opts);

        // Obtain compilation log from the program.
        size_t logSize;
        nvrtcGetProgramLogSize(prog, &logSize);
        char *log = new char[logSize];
        nvrtcGetProgramLog(prog, log);
        // Obtain PTX from the program.
        size_t ptxSize;
        nvrtcGetPTXSize(prog, &ptxSize);
        char *ptx = new char[ptxSize];
        nvrtcGetPTX(prog, ptx);

        printf("%s", ptx);

        CUdevice cuDevice;
        CUcontext context;
        CUmodule module;
        CUfunction* kernel;
        kernel = (CUfunction*)malloc(sizeof(CUfunction));
        cuInit(0);
        cuDeviceGet(&cuDevice, 0);
        cuCtxCreate(&context, 0, cuDevice);
        auto resultLoad = cuModuleLoadDataEx(&module, ptx, 0, 0, 0);
        auto resultGetF = cuModuleGetFunction(kernel, module, "GetSumOfBranches");
        return kernel;
}

一切正常，除了cuModuleGetFunction返回CUDA_ERROR_NOT_FOUND。发生该错误是因为GetSumOfBranches在 PTX 文件中找不到。

然而，输出printf("%s", ptx);是这样的：

// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-19856038
// Cuda compilation tools, release 7.5, V7.5.17
// Based on LLVM 3.4svn
//

.version 4.3
.target sm_52
.address_size 64

    // .globl   GetSumOfBranches

.visible .func GetSumOfBranches(
    .param .b64 GetSumOfBranches_param_0,
    .param .b64 GetSumOfBranches_param_1
)
{
    .reg .f64   %fd<8>;
    .reg .b64   %rd<3>;


    ld.param.u64    %rd1, [GetSumOfBranches_param_0];
    ld.param.u64    %rd2, [GetSumOfBranches_param_1];
    ld.f64  %fd1, [%rd1+304];
    ld.f64  %fd2, [%rd1+2672];
    sub.rn.f64  %fd3, %fd1, %fd2;
    ld.f64  %fd4, [%rd1+3184];
    sub.rn.f64  %fd5, %fd3, %fd4;
    ld.f64  %fd6, [%rd1+3288];
    sub.rn.f64  %fd7, %fd5, %fd6;
    st.f64  [%rd2], %fd7;
    ret;
}

在我看来，一切都很好，GetSumOfBranches可以被 cuModuleGetFunction. 你能解释一下为什么吗？

第二个问题

当我发表评论时std::string savedString = std::string(programText);，PTX 的输出只是：

// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-19856038
// Cuda compilation tools, release 7.5, V7.5.17
// Based on LLVM 3.4svn
//

.version 4.3
.target sm_52
.address_size 64

这很奇怪，因为savedString根本没有使用......

score 2 · Accepted Answer

不支持您尝试执行的操作。主机端模块管理 API 和设备 ELF 格式不公开__device__函数，仅公开__global__可通过内核启动 API 调用的函数。

您可以先验或在运行时编译设备函数，并以 JIT 方式将它们与内核链接，您可以检索这些内核并调用它们。但这就是你所能做的。

c++ - cuModuleGetFunction 返回未找到

1 回答 1

Related

Reference