“avx2”的相关标签问题_Stack Overflow中文网

0 投票

2 回答

2094 浏览

x86 - vextracti128 和 vextractf128 有什么区别？

vextracti128并vextractf128具有相同的功能、参数和返回值。另外一个是AVX指令集，另一个是AVX2。有什么区别？

2013-09-25T05:22:01.593

0 投票

2 回答

7002 浏览

x86 - 将 SSE/AVX 寄存器左移和右移 32 位，同时移入零

我想在移入零的同时向左或向右移动 SSE/AVX 寄存器的 32 位倍数。

让我更准确地了解我感兴趣的班次。对于 SSE，我想做以下四个 32 位浮点数的班次：

对于 AVX，我想换班做以下班次：

对于 SSE，我想出了以下代码

有没有更好的方法来使用 SSE 做到这一点？

对于 AVX，我提出了以下需要 AVX2 的代码（并且未经测试）。编辑（正如 Paul R 所解释的，此代码不起作用）。

如何使用 AVX 而不是 AVX2（例如使用_mm256_permuteor _mm256_shuffle`）做到最好？用 AVX2 有没有更好的方法来做到这一点？

编辑：

Paul R 告诉我，我的 AVX2 代码不起作用，而且 AVX 代码可能不值得。而对于 AVX2，我应该_mm256_permutevar8x32_ps与_mm256_and_ps. 我没有带有 AVX2 (Haswell) 的系统，所以这很难测试。

编辑：根据 Felix Wyss 的回答，我想出了一些 AVX 解决方案，其中 shift1_AVX 和 shift2_AVX 只需要 3 个内部函数，而 shift3_AVX 只需要一个内部函数。_mm256_permutef128Ps这是由于具有归零功能的事实。

shift1_AVX

shift2_AVX

shift3_AVX

x86 sse simd avx avx2

2013-10-22T11:27:08.247

0 投票

2 回答

1904 浏览

performance - Haswell memory access

I was experimenting with AVX -AVX2 instruction sets to see the performance of streaming on consecutive arrays. So I have below example, where I do basic memory read and store.

And after compiling with g++-4.9 -ggdb -march=core-avx2 -std=c++11 struct_of_arrays.cpp -O3 -o struct_of_arrays

I see quite good instruction per cycle performance and timings, for benchmark size 4000. However once I increase the benchmark size to 5000, I see instruction per cycle drops significantly and also latency jumps. Now my question is, although I can see that performance degradation seems to be related to L1 cache, I can not explain why this happens so suddenly.

To give more insight, if I run perf with Benchmark size 4000, and 5000

So my question is, why this impact is happening, considering haswell should be capable of delivering 2* 32 bytes to read, and 32 bytes store each cycle?

EDIT 1

I realized with this code gcc smartly eliminates accesses to the myData.a since it is set to 0. To avoid this I did another benchmark which is slightly different, where a is explicitly set.

Second example will have one array being read and other array being written. And this one produces following perf output for different sizes:

Again same pattern is seen as pointed out in the answer, with increasing data set size data does not fit in L1 anymore and L2 becomes bottleneck. What is also interesting is that prefetching does not seem to be helping and L1 misses increases considerably. Although, I would expect to see at least 50 percent hit rate considering each cache line brought into L1 for read will be a hit for the second access (64 byte cache line 32 byte is read with each iteration). However, once dataset is spilled over to L2 it seems L1 hit rate drops to 2%. Considering arrays are not really overlapping with L1 cache size this should not be because of cache conflicts. So this part still does not make sense to me.

performance x86 cpu-architecture avx2 intel-pmu

2013-10-27T18:08:22.337

0 投票

2 回答

753 浏览

assembly - 如何提取位于 AL 中定义的索引位置的字节

问题陈述：需要从ymm0寄存器中提取位于寄存器中某个位置的字节AL。

我的方法：（相当丑陋）：

有没有更优雅的方式来做到这一点？任何建议表示赞赏。挑战的症结在于VPEXTRB它只采用立即索引值，而不是CL（或AL）寄存器作为索引值

谢谢...

assembly x86-64 avx avx2

2013-11-17T14:55:39.410

0 投票

3 回答

6774 浏览

c++ - /arch:AVX 是否启用 AVX2？

是否/arch:AVX在 Visual Studio 2012 Update 4 上启用 AVX2（带有 256 位整数 SIMD 指令和一些新的 FP shuffle）？

思路：

是的，它启用了 AVX，因为 VS 没有提到 AVX2。但我认为VS可以做AVX2，因为我的内在工作。
不，这不是因为 AVX 和 AVX2 是独立的 CPU 功能
（Sandybridge 与 Haswell，或 Excavator/Zen 与 Bulldozer），
就像 SSE 和 SSE2 是独立的

AVX2

c++visual-c++visual-studio-2012 vectorization avx2

2013-11-23T23:11:40.343

0 投票

2 回答

1961 浏览

g++ - _mm256_loadu2_m128i 内在函数在 g++ 下不可用？

我正在尝试使用 AVX2 intrinsic _mm256_loadu2_m128i，但似乎 g++ 4.8.2 没有它。

有没有办法得到它？

g++intrinsics avx2

2013-12-15T06:26:03.050

0 投票

2 回答

4131 浏览

c++ - 在 Mavericks 上编译 AVX2 程序

我尝试在我的 Mac OS 10.9 上使用 gcc 版本 4.9.0 20131201 编译一个虚拟 AVX2 程序

我用这个命令编译

gcc -mavx -O0 测试.C

我得到这个错误

_mm256_max_epu8 使用always_inline属性。这是问题吗？

如果我用 O3 编译，我也会遇到问题。

我究竟做错了什么？

c++c gcc avx avx2

2013-12-24T10:50:38.583

0 投票

1 回答

3176 浏览

c - AVX2 中的 8 位移位操作，带零位移位

有没有办法重建_mm_slli_si128AVX2 中的指令以将__mm256i寄存器移动 x 字节？

_mm256_slli_si256似乎只是在 a[127:0] 和 a[255:128] 上执行两个_mm_slli_si128。

左移应该__m256i像这样工作：

我在线程中看到可以使用 32 位创建移位_mm256_permutevar8x32_ps。但我需要一个更通用的解决方案来移动 x 个字节。有人已经解决了这个问题吗？

c sse simd avx avx2

2013-12-25T17:00:04.763

0 投票

1 回答

4477 浏览

sse - 使用 Haswell 架构进行并行编程

我想学习使用 Intel 的 Haswell CPU 微架构进行并行编程。关于在 asm/C/C++/（任何其他语言）中使用 SIMD：SSE4.2、AVX2？。你能推荐书籍、教程、互联网资源、课程吗？

谢谢！

sse cpu-architecture avx avx2

2014-01-05T12:47:40.280

0 投票

1 回答

1887 浏览

c - 优化从 AVX2 寄存器中提取 64 位值

我尝试从 __m256i 寄存器中提取 64 位。我当前的提取功能示例：

以下代码将 4x8 位移动到寄存器位置 0-3，而不是提取 32 位。

代码工作正常，但速度很慢。看起来将 res1 和 res2 复制到 result_array 需要的时间最多。有没有办法优化它？

c sse avx avx2

2014-01-13T00:55:18.950

问题标签 [avx2]

Reference