c++ - 内联汇编+指针管理

Question

我对在 C++ 代码中使用内联汇编非常陌生。我想要做的基本上是一种大小为 32 的指针的 memcopy。

在 C++ 中，代码通常是这样的：

void my_memcpy(const std::uint8_t* in,std::uint8_t* out,const std::size_t& sz)
{

       assert((sz%32 == 0));

    for(const std::uint8_t* it = beg; it != (beg+sz);it+=32,out+=32)
    {
      __m256i = _mm256_stream_load_si256(reinterpret_cast<__m256i*>(it));
      _mm256_stream_si256(reinterpret_cast<__m256i*>(out),tmp);

    }            
}

我已经做了一点内联汇编，但每次我都事先知道输入选项卡和输出选项卡的大小。

所以我尝试了这个：

void my_memcpy(const std::uint8_t* in,std::uint8_t* out,const std::size_t& sz)
{

     assert((sz%32 == 0));

    __asm__ volatile(

                "mov %1, %%eax \n"
                "mov $0, %%ebx \n"

                "L1: \n"

                "vmovntdqa (%[src],%%ebx), %%ymm0 \n"
                "vmovntdq  %%ymm0, (%[dst],%%ebx) \n"

                "add %%ebx, $32 \n"

                "cmp %%eax, %%ebx \n"
                "jz L1 \n"

                :[dst]"=r"(out)
                :[src]"r"(in),"m"(sz)
                :"memory"
                );

}

G++ 告诉我：

Error: unsupported instruction `mov'
Error: `(%rdi,%ebx)' is not a valid base/index expression
Error: `(%rdi,%ebx)' is not a valid base/index expression
Error: operand type mismatch for `add'

所以我尝试了这个：

void my_memcpy(const std::uint8_t* in,std::uint8_t* out,const std::size_t& sz)
{

     assert((sz%32 == 0));
__asm__ volatile(

            "mov %1, %%eax \n"
            "mov $0, %%ebx \n"

            "L1: \n"

            "vmovntdqa %%ebx(%[src]), %%ymm0 \n"
            "vmovntdq  %%ymm0, (%[dst],%%ebx) \n"

            "add %%ebx, $32 \n"

            "cmp %%eax, %%ebx \n"
            "jz L1 \n"

            :[dst]"=r"(out)
            :[src]"r"(in),"m"(sz)
            :"memory"
                );

}

我从 G++ 获得：

Error: unsupported instruction `mov'
Error: junk `(%rdi)' after register
Error: `(%rdi,%ebx)' is not a valid base/index expression
Error: operand type mismatch for `add'

在每种情况下，我都试图找到没有成功的解决方案。我也体验过这个解决方案：

void my_memcpy(const std::uint8_t* in,std::uint8_t* out,const std::size_t& sz)
{

    __asm__ volatile (
          ".intel_syntax noprefix;"

          "mov eax, [SZ];"
          "mov ebx, 0;"

          "L1 : "

          "vmovntdqa ymm0, [src+ebx];"
          "vmovntdq [dst+ebx], ymm0;"

          "add ebx, 32 \n"

          "cmp ebx, eax \n"
          "jz L1 \n"
                ".att_syntax;"
          : [dst]"=r"(out)
          : [SZ]"m"(sz),[src]"r"(in)
          : "memory");



}

G++：

undefined reference to `SZ'
undefined reference to `src'
undefined reference to `dst'

那个消息看起来很常见，但我不知道在这种情况下如何修复它。

我也知道我的尝试并不严格代表我用 C++ 编写的代码。

我想了解我的尝试有什么问题，以及如何尽可能接近我的 C++ 函数。

提前致谢。

score 2 · Accepted Answer

您的第一个示例是最正确的，并且有以下错误：

它使用 32 位寄存器而不是 64 位。
更改了 3 个未指定为输出或破坏的寄存器。
EAX 加载的是源地址，而不是大小。
dst当它应该是输入时，它被声明为输出。
指令的参数add是错误的，在 AT&T 语法中，目标寄存器在最后。
使用了非本地标签，如果 asm 语句重复（例如通过内联），该标签将失败。

以及以下性能问题：

sz参数通过引用传递。（也可能会损害调用函数的优化）
然后将其作为内存参数传递到 asm 中，这需要将其写入内存。
然后将其复制到另一个寄存器。
使用固定寄存器而不是让编译器选择。

这是一个固定版本，它并不比具有内在函数的等效 C++ 快：

void my_memcpy(const std::uint8_t* in,std::uint8_t* out,const std::size_t sz)
{
     std::size_t count = 0;
     __m256i temp;

     assert((sz%32 == 0));

    __asm__ volatile(

                "1: \n"

                "vmovntdqa (%[src],%[count]), %[temp] \n"
                "vmovntdq  %[temp], (%[dst],%[count]) \n"

                "add $32, %[count] \n"

                "cmp %[sz], %[count] \n"
                "jz 1b \n"

                :[count]"+r"(count), [temp]"=x"(temp)
                :[dst]"r"(out), [src]"r"(in), [sz]"r"(sz)
                :"memory", "cc"
                );

}

源参数和目标参数是相反的，因为memcpy这可能会造成混淆。

您添加的 Intel 语法版本也未能使用正确的语法来引用参数（例如%[dst]）。

c++ - 内联汇编+指针管理

1 回答 1

Related

Reference