c++ - 使用 SSE 内在函数编译一个简单的 c++ 程序

Question

我是 SSE 说明的新手，我试图从这个站点学习它们：http: //www.codeproject.com/Articles/4522/Introduction-to-SSE-Programming

我在带有 Intel Core i7 960 CPU 的 Ubuntu 10.10 上使用 GCC 编译器

这是基于我尝试的文章的代码：

对于长度为 ARRAY_SIZE 的两个数组，它计算

fResult[i] = sqrt( fSource1[i]*fSource1[i] + fSource2[i]*fSource2[i] ) + 0.5

这是代码

#include <iostream>
#include <iomanip>
#include <ctime>
#include <stdlib.h>
#include <xmmintrin.h> // Contain the SSE compiler intrinsics
#include <malloc.h>
void myssefunction(
          float* pArray1,                   // [in] first source array
          float* pArray2,                   // [in] second source array
          float* pResult,                   // [out] result array
          int nSize)                        // [in] size of all arrays
{
    int nLoop = nSize/ 4;

    __m128 m1, m2, m3, m4;

    __m128* pSrc1 = (__m128*) pArray1;
    __m128* pSrc2 = (__m128*) pArray2;
    __m128* pDest = (__m128*) pResult;


    __m128 m0_5 = _mm_set_ps1(0.5f);        // m0_5[0, 1, 2, 3] = 0.5

    for ( int i = 0; i < nLoop; i++ )
    {
        m1 = _mm_mul_ps(*pSrc1, *pSrc1);        // m1 = *pSrc1 * *pSrc1
        m2 = _mm_mul_ps(*pSrc2, *pSrc2);        // m2 = *pSrc2 * *pSrc2
        m3 = _mm_add_ps(m1, m2);                // m3 = m1 + m2
        m4 = _mm_sqrt_ps(m3);                   // m4 = sqrt(m3)
        *pDest = _mm_add_ps(m4, m0_5);          // *pDest = m4 + 0.5

        pSrc1++;
        pSrc2++;
        pDest++;
    }
}

int main(int argc, char *argv[])
{
  int ARRAY_SIZE = atoi(argv[1]);
  float* m_fArray1 = (float*) _aligned_malloc(ARRAY_SIZE * sizeof(float), 16);
  float* m_fArray2 = (float*) _aligned_malloc(ARRAY_SIZE * sizeof(float), 16);
  float* m_fArray3 = (float*) _aligned_malloc(ARRAY_SIZE * sizeof(float), 16);

  for (int i = 0; i < ARRAY_SIZE; ++i)
    {
      m_fArray1[i] = ((float)rand())/RAND_MAX;
      m_fArray2[i] = ((float)rand())/RAND_MAX;
    }

  myssefunction(m_fArray1 , m_fArray2 , m_fArray3, ARRAY_SIZE);

  _aligned_free(m_fArray1);
   _aligned_free(m_fArray2);
   _aligned_free(m_fArray3);

  return 0;
}

我收到以下编译错误

[Programming/SSE]$ g++ -g -Wall -msse sseintro.cpp 
sseintro.cpp: In function ‘int main(int, char**)’:
sseintro.cpp:41: error: ‘_aligned_malloc’ was not declared in this scope
sseintro.cpp:53: error: ‘_aligned_free’ was not declared in this scope
[Programming/SSE]$

我在哪里搞砸了？我是否缺少一些头文件？我似乎已经包括了所有相关的。

score 16 · Accepted Answer

_aligned_malloc和_aligned_free是微软主义。在 Linux等上使用posix_memalign或memalign。对于 Mac OS X，您可以只使用 malloc，因为它始终是 16 字节对齐的。对于可移植的 SSE 代码，您通常希望为对齐的内存分配实现包装函数，例如

void * malloc_simd(const size_t size)
{
#if defined WIN32           // WIN32
    return _aligned_malloc(size, 16);
#elif defined __linux__     // Linux
    return memalign(16, size);
#elif defined __MACH__      // Mac OS X
    return malloc(size);
#else                       // other (use valloc for page-aligned memory)
    return valloc(size);
#endif
}

的实现free_simd留给读者作为练习。

score 1 · Accepted Answer

简短的回答：使用_mm_mallocand _mm_freefromxmmintrin.h代替_aligned_mallocand _aligned_free。

讨论

在编写 SSE/AVX 代码时，不应使用_aligned_malloc、_aligned_free、posix_memalign、或其他任何内容。memalign这些都是特定于编译器/平台的函数（MSVC 或 GCC 或 POSIX）。

Intel 引入了专门用于 SIMD 计算的函数_mm_malloc和_mm_freeIntel 编译器（参见参考资料）。其他具有 x86 目标架构的编译器也添加了它们（就像它们定期添加 Intel 内在函数一样）。从这个意义上说，它们是唯一的跨平台解决方案：它们应该在每个支持 SSE 的编译器中都可用。

这些函数在xmmintrin.h头文件中声明。smmintrin.h以后 SSE/AVX 版本的任何标头都会自动包含以前的标头，因此仅包含或emmintrin.h例如就足够了。

score 0 · Accepted Answer

这并不能直接回答您的问题，但我想指出您的 SSE 代码编写不正确，如果它有效，我会感到惊讶。您需要对包含对齐的非 sse 类型（如对齐的浮点数组）的非 sse 类型使用加载/存储操作（即使您有 SSE 类型的动态数组，也需要执行此操作）。您需要记住，当您使用 SSE 时，假设 SSE 数据类型代表 SSE 寄存器中的数据，而其他所有内容通常都在系统内存或非 SSE 寄存器中，因此您需要从/注册和记忆。你的函数应该是这样的：

void myssefunction
(
    float* pArray1,                   // [in] first source array
    float* pArray2,                   // [in] second source array
    float* pResult,                   // [out] result array
    int nSize                         // [in] size of all arrays
)                                   
{
    const __m128 m0_5 = _mm_set_ps1(0.5f);        // m0_5[0, 1, 2, 3] = 0.5
    for (size_t index = 0; index < nSize; index += 4)
    {
        __m128 pSrc1 = _mm_load_ps(pArray1 + index); // load 4 elements from memory into SSE register
        __m128 pSrc2 = _mm_load_ps(pArray2 + index); // load 4 elements from memory into SSE register

        __m128 m1   = _mm_mul_ps(pSrc1, pSrc1);        // m1 = *pSrc1 * *pSrc1
        __m128 m2   = _mm_mul_ps(pSrc2, pSrc2);        // m2 = *pSrc2 * *pSrc2
        __m128 m3   = _mm_add_ps(m1, m2);                // m3 = m1 + m2
        __m128 m4   = _mm_sqrt_ps(m3);                   // m4 = sqrt(m3)
        __m128 pDest  = _mm_add_ps(m4, m0_5);          // pDest = m4 + 0.5

        _mm_store_ps(pResult + index, pDest); // store 4 elements from SSE register to memory.
    }
}

还值得注意的是，在给定时间内可以使用的寄存器数量是有限制的（对于 SSE2 来说是 16 个）。您可以编写尝试使用超过限制的代码，但这会导致寄存器溢出。

c++ - 使用 SSE 内在函数编译一个简单的 c++ 程序

3 回答 3

讨论

Related

Reference