optimization - CUDA 中全局优化的成本函数计算

Question

我正在尝试使用参数（）优化一个函数（比如找到最小值）。所有's 都绑定到某个范围（例如to ），如果任何参数离开此范围，则函数会非常快地变为无穷大。但是，可能很大（从到大约）并且计算它的值需要很长时间。nXnXi-200200n2060-70

我不认为关于函数的细节有很大的相关性，但这里有一些：它由较小的函数（所有不同的）的加权和组成，20-30这些函数本身由倒数符号下的点积总和组成正弦函数（arcsin、arccos、arctan等）。类似的东西arcsin(X1 . X2) + arcsin(X4 . X7) + ...。

该函数通常具有许多局部最小值，因此（朴素）共轭梯度或准牛顿等方法是无用的。搜索整个域蛮力太慢了。

我最初的想法是结合遗传算法使用某种大规模并行化，该算法在函数域中的不同位置执行许多搜索，并定期检查某些搜索是否达到局部最小值。如果是，它比较它们并丢弃除最小的结果之外的所有结果，并继续搜索直到找到一个相当小的值。

我的两个问题是：

1）是否可以在 CUDA 或类似技术中实现这个问题？CUDA 能足够快地计算出这样一个函数的值吗？

2）在多核PC（12+核）上解决问题会更好/更快吗？

score 4 · Accepted Answer

在 GPU 上实现全局优化算法当然是可能的，但您可能必须修改算法以获得最佳性能。我个人在 GPU 上实现了大约六种基于人口的元启发式算法，并且相对于多线程 CPU 可以获得出色的加速。

随着基于种群的算法迭代几代，您可能会发现种群大小成为您性能的限制因素。如果您的目标函数不容易并行化，那么并行线程的最大数量通常是总体中候选解决方案的数量。GPU 在至少有数万个同时线程的情况下工作得最好，由于人口惯性和其他影响，这对于基于人口的算法并不总是实用的。

您可以通过并行运行多个优化实例在一定程度上绕过人口限制，每个实例都从不同的随机种子开始。此外，您可以安排并行实例通过某种网络拓扑进行通信（这将是一个孤岛模型算法），以便并行实例可以协作演化复杂问题的解决方案。我实际上已经在 OpenCL 中为我的 MScE 论文实现了这一点。

总的来说，这里是您问题的简洁答案：

1) 是的，您可以在 CUDA 或类似技术中实现此功能。您的加速将取决于您的目标函数的可并行化程度以及您希望同时运行多少个并行实例。2) 由于现有库范围更广，而且 CPU 编程模型在概念上比 GPU 模型更简单，因此在 CPU 上实现几乎肯定会更快。它是否“更好”取决于您的价值。

这仍然是一个活跃的研究领域，与构建多核 CPU 实现相比，构建一个有效的 GPU 实现可能需要更长的时间。如果实现时间是您最关心的问题，我建议您查看 PaGMO 项目（及其 Python 绑定 PyGMO），它是岛模型优化器的出色实现，具有广泛的局部和全局优化功能包括。可以为这些岛屿分配任意算法以独立运行，并且您可以准确指定它们如何通信以协同优化您的目标函数。

http://sourceforge.net/apps/mediawiki/pagmo/index.php?title=Main_Page

您的问题完全属于我的研究领域，如果您需要，我很乐意为您提供进一步的帮助。

score 2 · Accepted Answer

CUDA 能足够快地计算出这样一个函数的值吗？

许多成本泛函以一定数量的项的总和的形式表示。例子是：

球体功能

在此处输入图像描述

罗森布洛克函数

在此处输入图像描述

Styblinski-Tang 函数

在此处输入图像描述

在所有这些情况下，成本函数的评估可以通过缩减来执行，或者更好的是，先转换然后缩减。

CUDA Thrustthrust::transform_reduce肯定可以服务于范围，但您当然可以设置自己的转换 + 缩减例程。

下面，我将提供一个示例，说明如何使用 CUDA Thrust 或 CUDA 示例提供的缩减例程的自定义版本来计算 Rosenbrock 函数。在后一种情况下，如果定义了关键字，或者在定制例程的编译单元中定义并编译了转换函数，则将指向__device__转换函数的指针传递给定制函数。transform_reduceEXTERNALtransform_reduce

在非外部情况下，Kepler K20c 卡的一些性能结果：

N =   90000       Thrust = 0.055ms       Customized = 0.059ms
N =  900000       Thrust = 0.67ms        Customized = 0.14ms
N = 9000000       Thrust = 0.85ms        Customized = 0.87ms

这是代码。有关计时功能，请参阅OrangeOwlSolutions/Timing和OrangeOwlSolutions/CUDA_Utilities github 项目。

请注意，需要单独编译。

内核.cu

// --- Requires separate compilation

#include <stdio.h>

#include <thrust/device_vector.h>

#include "transform_reduce.cuh"
#include "Utilities.cuh"
#include "TimingCPU.h"
#include "TimingGPU.cuh"

/***************************************/
/* COST FUNCTION - GPU/CUSTOMIZED CASE */
/***************************************/
// --- Transformation function
__device__ __forceinline__ float transformation(const float * __restrict__ x, const int i) { return (100.f * (x[i+1] - x[i] * x[i]) * (x[i+1] - x[i] * x[i]) + (x[i] - 1.f) * (x[i] - 1.f)) ; }
// --- Device-side function pointer
__device__ pointFunction_t dev_pfunc = transformation;

/***********************************/
/* COST FUNCTION - GPU/THRUST CASE */
/***********************************/
struct CostFunctionStructGPU{
template <typename Tuple>
    __host__ __device__ float operator()(Tuple a) {

        float temp1 = (thrust::get<1>(a) - thrust::get<0>(a) * thrust::get<0>(a));
        float temp2 = (thrust::get<0>(a) - 1.f);

        return 100.f * temp1 * temp1 + temp2 * temp2;
    }
};

/********/
/* MAIN */
/********/
int main()
{
    const int N = 90000000;

    float *x = (float *)malloc(N * sizeof(float));
    for (int i=0; i<N; i++) x[i] = 3.f;

    float *d_x; gpuErrchk(cudaMalloc((void**)&d_x, N * sizeof(float)));
    gpuErrchk(cudaMemcpy(d_x, x, N * sizeof(float), cudaMemcpyHostToDevice));

    /************************************************/
    /* "CUSTOMIZED" DEVICE-SIDE TRANSFORM REDUCTION */
    /************************************************/
    float customizedDeviceResult = transform_reduce(d_x, N - 1, &dev_pfunc);
    TimingGPU timerGPU;
    timerGPU.StartCounter();
    customizedDeviceResult = transform_reduce(d_x, N - 1, &dev_pfunc);
    printf("Timing for 'customized', device-side transform reduction = %f\n", timerGPU.GetCounter());
    printf("Result for 'customized', device-side transform reduction = %f\n", customizedDeviceResult);
    printf("\n\n");

    /************************************************/
    /* THRUST-BASED DEVICE-SIDE TRANSFORM REDUCTION */
    /************************************************/
    thrust::device_vector<float> d_vec(N,3.f);

    timerGPU.StartCounter();
    float ThrustResult = thrust::transform_reduce(thrust::make_zip_iterator(thrust::make_tuple(d_vec.begin(), d_vec.begin() + 1)), thrust::make_zip_iterator(thrust::make_tuple(d_vec.begin() + N - 1, d_vec.begin() + N)), CostFunctionStructGPU(), 0.f, thrust::plus<float>());
    printf("Timing for Thrust-based, device-side transform reduction = %f\n", timerGPU.GetCounter());
    printf("Result for Thrust-based, device-side transform reduction = %f\n", ThrustResult);
    printf("\n\n");

    /*********************************/
    /* HOST-SIDE TRANSFORM REDUCTION */
    /*********************************/
//  thrust::host_vector<float> h_vec(d_vec);
        //sum_host = sum_host + transformation(thrust::raw_pointer_cast(h_vec.data()), i);

    TimingCPU timerCPU;
    timerCPU.StartCounter();
    float sum_host = 0.f;
    for (int i=0; i<N-1; i++) {
        float temp = (100.f * (x[i+1] - x[i] * x[i]) * (x[i+1] - x[i] * x[i]) + (x[i] - 1.f) * (x[i] - 1.f));
        sum_host = sum_host + temp;
        //printf("%i %f %f\n", i, temp, sum_host);
    }

    printf("Timing for host-side transform reduction = %f\n", timerCPU.GetCounter());
    printf("Result for host-side transform reduction = %f\n", sum_host);
    printf("\n\n");

    sum_host = 0.f;
    float c        = 0.f;
    for (int i=0; i<N-1; i++) {
        float temp = (100.f * (x[i+1] - x[i] * x[i]) * (x[i+1] - x[i] * x[i]) + (x[i] - 1.f) * (x[i] - 1.f)) - c;
        float t    = sum_host + temp;
        c          = (t - sum_host) - temp;
        sum_host = t;
    }

    printf("Result for host-side transform reduction = %f\n", sum_host);

//  cudaDeviceReset();

}

transform_reduce.cuh

#ifndef TRANSFORM_REDUCE_CUH
#define TRANSFORM_REDUCE_CUH

// --- Function pointer type
// --- Complete with your own favourite instantiations
typedef float(*pointFunction_t)(const float * __restrict__, const int);

template <class T> T transform_reduce(T *, unsigned int, pointFunction_t *);

#endif

transform_reduce.cu

#include <stdio.h>

#include "Utilities.cuh"
#include "transform_reduce.cuh"

#define BLOCKSIZE 512
#define warpSize 32

// --- Host-side function pointer
pointFunction_t h_pfunc;

// --- Uncomment if you want to apply CUDA error checking to the kernel launches
//#define DEBUG

//#define EXTERNAL

/*******************************************************/
/* CALCULATING THE NEXT POWER OF 2 OF A CERTAIN NUMBER */

/*******************************************************/
unsigned int nextPow2(unsigned int x)
{
    --x;
    x |= x >> 1;
    x |= x >> 2;
    x |= x >> 4;
    x |= x >> 8;
    x |= x >> 16;
    return ++x;
}

/*************************************/
/* CHECK IF A NUMBER IS A POWER OF 2 */
/*************************************/
// --- Note: although x = 1 is a power of 2 (1 = 2^0), this routine returns 0 for x == 1.
bool isPow2(unsigned int x) {
    if (x == 1) return 0;
    else        return ((x&(x-1))==0);
}

/***************************/
/* TRANSFORMATION FUNCTION */
/***************************/
template <class T>
__host__ __device__ __forceinline__ T transformation(const T * __restrict__ x, const int i) { return ((T)100 * (x[i+1] - x[i] * x[i]) * (x[i+1] - x[i] * x[i]) + (x[i] - (T)1) * (x[i] - (T)1)) ; }

/********************/
/* REDUCTION KERNEL */
/********************/
/*
    This version adds multiple elements per thread sequentially.  This reduces the overall
    cost of the algorithm while keeping the work complexity O(n) and the step complexity O(log n).
    (Brent's Theorem optimization)

    Note, this kernel needs a minimum of 64*sizeof(T) bytes of shared memory.
    In other words if blockSize <= 32, allocate 64*sizeof(T) bytes.
    If blockSize > 32, allocate blockSize*sizeof(T) bytes.
*/
template <class T, unsigned int blockSize, bool nIsPow2>
__global__ void reductionKernel(T *g_idata, T *g_odata, unsigned int N, pointFunction_t pPointTransformation)
{
    extern __shared__ T sdata[];

    unsigned int tid    = threadIdx.x;                              // Local thread index
    unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;       // Global thread index - Fictitiously double the block dimension
    unsigned int gridSize = blockSize*2*gridDim.x;

    // --- Performs the first level of reduction in registers when reading from global memory on multiple elements per thread.
    //     More blocks will result in a larger gridSize and therefore fewer elements per thread
    T mySum = 0;

    while (i < N) {
#ifdef EXTERNAL
        mySum += (*pPointTransformation)(g_idata, i);
#else
        mySum += transformation(g_idata, i);
#endif
        // --- Ensure we don't read out of bounds -- this is optimized away for powerOf2 sized arrays
        if (nIsPow2 || i + blockSize < N)
#ifdef EXTERNAL
            mySum += (*pPointTransformation)(g_idata, i+blockSize);
#else
            mySum += transformation(g_idata, i+blockSize);
#endif
        i += gridSize; }

    // --- Each thread puts its local sum into shared memory
    sdata[tid] = mySum;
    __syncthreads();

    // --- Reduction in shared memory. Fully unrolled loop.
    if ((blockSize >= 512) && (tid < 256)) sdata[tid] = mySum = mySum + sdata[tid + 256];
    __syncthreads();

    if ((blockSize >= 256) && (tid < 128)) sdata[tid] = mySum = mySum + sdata[tid + 128];
     __syncthreads();

    if ((blockSize >= 128) && (tid <  64)) sdata[tid] = mySum = mySum + sdata[tid +  64];
    __syncthreads();

#if (__CUDA_ARCH__ >= 300 )
    // --- Single warp reduction by shuffle operations
    if ( tid < 32 )
    {
        // --- Last iteration removed from the for loop, but needed for shuffle reduction
        mySum += sdata[tid + 32];
        // --- Reduce final warp using shuffle
        for (int offset = warpSize/2; offset > 0; offset /= 2) mySum += __shfl_down(mySum, offset);
        //for (int offset=1; offset < warpSize; offset *= 2) mySum += __shfl_xor(mySum, i);
    }
#else
    // --- Reduction within a single warp. Fully unrolled loop.
    if ((blockSize >=  64) && (tid < 32)) sdata[tid] = mySum = mySum + sdata[tid + 32];
    __syncthreads();

    if ((blockSize >=  32) && (tid < 16)) sdata[tid] = mySum = mySum + sdata[tid + 16];
    __syncthreads();

    if ((blockSize >=  16) && (tid <  8)) sdata[tid] = mySum = mySum + sdata[tid +  8];
    __syncthreads();

    if ((blockSize >=   8) && (tid <  4)) sdata[tid] = mySum = mySum + sdata[tid +  4];
     __syncthreads();

    if ((blockSize >=   4) && (tid <  2)) sdata[tid] = mySum = mySum + sdata[tid +  2];
    __syncthreads();

    if ((blockSize >=   2) && ( tid < 1)) sdata[tid] = mySum = mySum + sdata[tid +  1];
    __syncthreads();
#endif

    // --- Write result for this block to global memory. At the end of the kernel, global memory will contain the results for the summations of
    //     individual blocks
    if (tid == 0) g_odata[blockIdx.x] = mySum;
}

/******************************/
/* REDUCTION WRAPPER FUNCTION */
/******************************/
template <class T>
T transform_reduce_inner(T *g_idata, unsigned int N, pointFunction_t h_pfunc) {

    // --- Reduction parameters
    const int NumThreads    = (N < BLOCKSIZE) ? nextPow2(N) : BLOCKSIZE;
    const int NumBlocks     = (N + NumThreads - 1) / NumThreads;
    const int smemSize      = (NumThreads <= 32) ? 2 * NumThreads * sizeof(T) : NumThreads * sizeof(T);

    // --- Device memory space where storing the partial reduction results
    T *g_odata; gpuErrchk(cudaMalloc((void**)&g_odata, NumBlocks * sizeof(T)));

    if (isPow2(N)) {
        switch (NumThreads) {
            case 512: reductionKernel<T, 512, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
            case 256: reductionKernel<T, 256, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
            case 128: reductionKernel<T, 128, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
            case 64:  reductionKernel<T,  64, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
            case 32:  reductionKernel<T,  32, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
            case 16:  reductionKernel<T,  16, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
            case  8:  reductionKernel<T,   8, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
            case  4:  reductionKernel<T,   4, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
            case  2:  reductionKernel<T,   2, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
            case  1:  reductionKernel<T,   1, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
        }
#ifdef DEBUG
        gpuErrchk(cudaPeekAtLastError());
        gpuErrchk(cudaDeviceSynchronize());
#endif
    }
    else {
        switch (NumThreads) {
            case 512: reductionKernel<T, 512, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
            case 256: reductionKernel<T, 256, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
            case 128: reductionKernel<T, 128, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
            case 64:  reductionKernel<T,  64, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
            case 32:  reductionKernel<T,  32, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
            case 16:  reductionKernel<T,  16, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
            case  8:  reductionKernel<T,   8, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
            case  4:  reductionKernel<T,   4, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
            case  2:  reductionKernel<T,   2, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
            case  1:  reductionKernel<T,   1, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
        }
#ifdef DEBUG
        gpuErrchk(cudaPeekAtLastError());
        gpuErrchk(cudaDeviceSynchronize());
#endif
    }

    // --- The last part of the reduction, which would be expensive to perform on the device, is executed on the host
    T *host_vector = (T *)malloc(NumBlocks * sizeof(T));
    gpuErrchk(cudaMemcpy(host_vector, g_odata, NumBlocks * sizeof(T), cudaMemcpyDeviceToHost));

    T sum_transformReduce = (T)0;
    for (int i=0; i<NumBlocks; i++) sum_transformReduce = sum_transformReduce + host_vector[i];

    return sum_transformReduce;
}

template <class T>
T transform_reduce(T *g_idata, unsigned int N, pointFunction_t *dev_pfunc) {
#ifdef EXTERNAL
    gpuErrchk(cudaMemcpyFromSymbol(&h_pfunc, *dev_pfunc, sizeof(pointFunction_t)));
#endif
    T customizedDeviceResult = transform_reduce_inner(g_idata, N, h_pfunc);
    return customizedDeviceResult;
}


// --- Complete with your own favourite instantiations
template float transform_reduce(float *, unsigned int, pointFunction_t *);

score 0 · Accepted Answer

我上面的答案主要适用于具有大量未知数的问题，这些问题通常由局部优化算法处理。我将把它留在这里以供其他用户参考。

正如您所提到的，您正在处理60-70未知数，这是一种可以通过全局优化算法更容易管理的场景。

如上所述，成本泛函通常由求和组成，因此它们的计算相当于后续的变换和归约操作。有了这么多的未知数，减少共享内存可能是一个有趣的选择。幸运的是，CUB提供了减少共享内存的原语。

这是一个关于如何使用 CUB 计算具有中等数量未知数的问题的大量成本函数值的工作示例。在这种情况下，成本函数被选择为 Rastrigin 函数，但该示例可以通过更改相应的函数来适应其他成本__device__函数。

#include <cub/cub.cuh>
#include <cuda.h>

#include "Utilities.cuh"

#include <iostream>

#define BLOCKSIZE           256

const int N = 4096;

/************************/
/* RASTRIGIN FUNCTIONAL */
/************************/
__device__ float rastrigin(float x) {

    return x * x - 10.0f * cosf(2.0f * x) + 10.0f;

}

/******************************/
/* TRANSFORM REDUCTION KERNEL */
/******************************/
__global__ void CostFunctionalCalculation(const float * __restrict__ indata, float * __restrict__ outdata) {

    unsigned int tid = threadIdx.x + blockIdx.x * gridDim.x;

    // --- Specialize BlockReduce for type float. 
    typedef cub::BlockReduce<float, BLOCKSIZE> BlockReduceT;

    __shared__ typename BlockReduceT::TempStorage  temp_storage;

    float result;
    if(tid < N) result = BlockReduceT(temp_storage).Sum(rastrigin(indata[tid]));

    if(threadIdx.x == 0) outdata[blockIdx.x] = result;

    return;
}

/********/
/* MAIN */
/********/
int main() {

    // --- Allocate host side space for 
    float *h_data       = (float *)malloc(N               * sizeof(float));
    float *h_result     = (float *)malloc((N / BLOCKSIZE) * sizeof(float));

    float *d_data;      gpuErrchk(cudaMalloc(&d_data,   N               * sizeof(float)));
    float *d_result;    gpuErrchk(cudaMalloc(&d_result, (N / BLOCKSIZE) * sizeof(float)));

    for (int i = 0; i < N; i++) {
        h_data[i] = 1.f;
    }

    gpuErrchk(cudaMemcpy(d_data, h_data, N * sizeof(float), cudaMemcpyHostToDevice));

    CostFunctionalCalculation<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_data, d_result);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());

    gpuErrchk(cudaMemcpy(h_result, d_result, (N / BLOCKSIZE) * sizeof(float), cudaMemcpyDeviceToHost));

    std::cout << "output: \n";
    for (int k = 0; k < N / BLOCKSIZE; k++) std::cout << k << " " << h_result[k] << "\n";
    std::cout << std::endl;

    gpuErrchk(cudaFree(d_data));
    gpuErrchk(cudaFree(d_result));

    return 0;
}

optimization - CUDA 中全局优化的成本函数计算

3 回答 3

Related

Reference