你不是正确的轨道,但是在 GPU 上对数组进行就地操作是很棘手的,因为你不能保证不同元素的更新顺序。
这是一个非常相似的例子。该ApplyColorSimplifierTiledHelper
方法包含一个 AMP 受限的 parallel_for_each,它调用SimplifyIndexTiled
2D 数组中的每个索引。根据 in中相应像素周围的像素SimplifyIndexTiled
值计算每个像素 in 的新值。这解决了代码中存在的竞争条件问题。destFrame
srcFrame
此代码来自C++ AMP 书籍的 Codeplex 站点。Cartoonizer 案例研究包括使用 C++ AMP 实现的此类图像处理问题的几个示例;阵列、纹理、平铺/非平铺和多 GPU。C++ AMP 书详细讨论了实现。
void ApplyColorSimplifierTiledHelper(const array<ArgbPackedPixel, 2>& srcFrame,
array<ArgbPackedPixel, 2>& destFrame, UINT neighborWindow)
{
const float_3 W(ImageUtils::W);
assert(neighborWindow <= FrameProcessorAmp::MaxNeighborWindow);
tiled_extent<FrameProcessorAmp::TileSize, FrameProcessorAmp::TileSize>
computeDomain = GetTiledExtent(srcFrame.extent);
parallel_for_each(computeDomain, [=, &srcFrame, &destFrame]
(tiled_index<FrameProcessorAmp::TileSize, FrameProcessorAmp::TileSize> idx)
restrict(amp)
{
SimplifyIndexTiled(srcFrame, destFrame, idx, neighborWindow, W);
});
}
void SimplifyIndex(const array<ArgbPackedPixel, 2>& srcFrame, array<ArgbPackedPixel,
2>& destFrame, index<2> idx,
UINT neighborWindow, const float_3& W) restrict(amp)
{
const int shift = neighborWindow / 2;
float sum = 0;
float_3 partialSum;
const float standardDeviation = 0.025f;
const float k = -0.5f / (standardDeviation * standardDeviation);
const int idxY = idx[0] + shift; // Corrected index for border offset.
const int idxX = idx[1] + shift;
const int y_start = idxY - shift;
const int y_end = idxY + shift;
const int x_start = idxX - shift;
const int x_end = idxX + shift;
RgbPixel orgClr = UnpackPixel(srcFrame(idxY, idxX));
for (int y = y_start; y <= y_end; ++y)
for (int x = x_start; x <= x_end; ++x)
{
if (x != idxX || y != idxY) // don't apply filter to the requested index, only to the neighbors
{
RgbPixel clr = UnpackPixel(srcFrame(y, x));
float distance = ImageUtils::GetDistance(orgClr, clr, W);
float value = concurrency::fast_math::pow(float(M_E), k * distance * distance);
sum += value;
partialSum.r += clr.r * value;
partialSum.g += clr.g * value;
partialSum.b += clr.b * value;
}
}
RgbPixel newClr;
newClr.r = static_cast<UINT>(clamp(partialSum.r / sum, 0.0f, 255.0f));
newClr.g = static_cast<UINT>(clamp(partialSum.g / sum, 0.0f, 255.0f));
newClr.b = static_cast<UINT>(clamp(partialSum.b / sum, 0.0f, 255.0f));
destFrame(idxY, idxX) = PackPixel(newClr);
}
代码使用ArgbPackedPixel
,这只是一种将 8 位 RGB 值打包成unsigned long
C++ AMP 不支持的机制char
。如果您的问题小到可以放入纹理中,那么您可能需要考虑使用它而不是数组,因为打包/解包是在 GPU 上的硬件中实现的,因此实际上是“免费的”,您必须为此付费与额外的计算。CodePlex 上也有此实现的示例。
typedef unsigned long ArgbPackedPixel;
struct RgbPixel
{
unsigned int r;
unsigned int g;
unsigned int b;
};
const int fixedAlpha = 0xFF;
inline ArgbPackedPixel PackPixel(const RgbPixel& rgb) restrict(amp)
{
return (rgb.b | (rgb.g << 8) | (rgb.r << 16) | (fixedAlpha << 24));
}
inline RgbPixel UnpackPixel(const ArgbPackedPixel& packedArgb) restrict(amp)
{
RgbPixel rgb;
rgb.b = packedArgb & 0xFF;
rgb.g = (packedArgb & 0xFF00) >> 8;
rgb.r = (packedArgb & 0xFF0000) >> 16;
return rgb;
}