java - 是否可以从 JCuda 将数据发送到定义为 Union 的 GPU 内存？

Question

我在 GPU 端（cuda）定义了一个像这样的新数据类型：

typedef union {
    int i;
    double d;
    long l;
    char s[16];
} data_unit;

data_unit *d_array;

在 Java 中，我们有一个数组，其中一种类型在已定义的联合中可用。通常，如果我们有一个 int 类型的数组，我们可以在 Java (JCuda) 中执行以下操作：

import static jcuda.driver.JCudaDriver.*;


int data_size;
CUdeviceptr d_array;
int[] h_array = new int[data_size];

cuMemAlloc(d_array, data_size * Sizeof.INT);
cuMemcpyHtoD(d_array, Pointer.to(h_array), data_size * Sizeof.INT);

但是如果设备上有一个数组，它的类型是我们的联合，怎么办？（假设 h_array 仍然是 int 类型）

int data_size;
CUdeviceptr d_array;
int[] h_array = new int[data_size];

cuMemAlloc(d_array, data_size * Sizeof.?);
// Here we should have some type of alignment (?)
cuMemcpyHtoD(d_array, Pointer.to(h_array), data_size * Sizeof.?);

score 5 · Accepted Answer

我认为对工会是什么存在根本性的误解。

让我们考虑一下。是什么让联合与结构不同？它可以在不同的时间存储不同类型的数据。

它是如何完成这一壮举的？好吧，可以使用某种单独的变量来动态指定类型或它占用多少内存，但是 Union 不会这样做，它依赖于程序员确切地知道他们想要检索什么类型以及何时检索。因此，唯一的选择是，如果程序员仅在任何给定时间点实际知道类型，则仅确保为联合变量分配了足够的空间，以便始终可以将其用于任何类型。

事实上，这就是联合所做的，请参见此处（是的，我知道它是 C/C++，但这也适用于 CUDA）。这对你意味着什么？这意味着联合数组的大小应该是其最大成员的大小 x 元素的数量，因为联合的大小是其最大成员的大小。

让我们看看你的工会，看看如何解决这个问题。

typedef union {
    int i;
    double d;
    long l;
    char s[16];
} data_unit;

您的工会有：

int i，我们假设为 4 个字节
double d，即 8 个字节
long l，这是令人困惑的，因为取决于编译器/平台可以是 4 或 8 字节，我们现在假设 8 字节。
char s[16], 简单, 16 字节

因此，任何成员占用的最大字节数是您的char s[16]变量，16 个字节。这意味着您需要将代码更改为：

int data_size;
int union_size = 16;
CUdeviceptr d_array;
// copying this to the device will not result in what you expect with out over allocating
// if you just copy over integers, which occupy 4 bytes each, your integers will fill less space than the number of unions 
//  we need to make sure that there is a "stride" here if we want to actually copy real data from host to device. 
// union_size / Sizeof.INT = 4, so there will be 4 x as many ints, 4 for each union. 
int[] h_array = new int[data_size * (union_size / Sizeof.INT)];


// here we aren't looking for size of int to allocate, but the size of our union. 
cuMemAlloc(d_array, data_size * union_size);
// we are copying, again, data_size * union_size bytes
cuMemcpyHtoD(d_array, Pointer.to(h_array), data_size * union_size);

笔记

如果要复制 int，这基本上意味着您需要将每 4 个 int分配给该索引所需的实际 int。

int 0 是h_array[0]，int 1 是h_array[4]int 2 是h_array[8]int n 是h_array[n * 4]等等。

score 1 · Accepted Answer

我用一些脏代码做了对齐和填充。此外，重要的是要注意编译器之间的字节顺序差异。Java 似乎以BIG_ENDIAN格式存储字节。所以在这里我不得不把它改成LITTLE_ENDIAN来完成它。我花了2个小时调试。这是它现在的样子：

int data_size;
int union_size = 16;
// Device Array
CUdeviceptr d_array; 
// Host Array
int[] h_array = new int[data_size];
byte[] h_array_bytes = new byte[data_size * union_size];

// Data allocation on GPU memory
cuMemAlloc(d_array, data_size * union_size);

// Alignment and padding
byte[] tempBytes;

for(int i = 0; i < data_size; i++){
    tempBytes = ByteBuffer.allocate(Integer.BYTES).order(ByteOrder.LITTLE_ENDIAN)
                .putInteger(h_array[i]).array();
    int start = i * union_size;
    for(int j = start, k = 0; k < union_size; k++, j++){
        if(k < tempBytes.length){
            h_array_bytes[j] = tempBytes[k];
        } else {
            h_array_bytes[j] = 0;
        }
    }
}
// And then simply do the copy 
cuMemcpyHtoD(d_array, Pointer.to(h_array_bytes), data_size * union_size);

java - 是否可以从 JCuda 将数据发送到定义为 Union 的 GPU 内存？

2 回答 2

笔记

Related

Reference