2

给定以下输入字节:

var vBytes = new Vector<byte>(new byte[] {72, 101, 55, 08, 108, 111, 55, 87, 111, 114, 108, 55, 100, 55, 55, 20});

和给定的面具:

var mask = new Vector<byte>(55);

如何找到55输入数组中的字节数?

我试过用以下方法vBytes异或mask

var xored = Vector.Xor(mask, vBytes);

这使:

<127、82、0、91、91、88、0、96、88、69、91、0、83、0、0、35>

但不知道我怎么能从中得到计数。

为简单起见,我们假设输入字节长度始终等于Vector<byte>.Count.

4

4 回答 4

4

(以下想法的 AVX2 C 内在函数实现,以防一个具体示例有帮助:如何使用 SIMD 计算字符出现次数

在 asm 中,您希望pcmpeqb生成一个 0 或 0xFF 的向量。被视为有符号整数,即 0/-1。

然后使用比较结果作为整数值psubb0 / 1 添加到该元素的计数器。(减 -1 = 加 +1)

这可能会在 256 次迭代后溢出,因此在此之前的某个时间,使用psadbw反对_mm_setzero_si128()将那些无符号字节(没有溢出)水平求和为 64 位整数(每组 8 个字节一个 64 位整数)。然后paddq累加 64 位总数。

在溢出之前累积可以通过嵌套循环来完成,或者只是在常规展开循环的末尾。 psadbw速度很快(因为它是视频编码运动搜索的关键构建块),因此每 4 次比较或什至每 1 次比较累加并跳过psubb.

有关 x86 的更多详细信息,请参阅Agner Fog 的优化指南。根据他的指令表,psadbw xmm/vpsadbw ymm在 Skylake 上以每个时钟周期运行 1 个向量,具有 3 个周期延迟。(前端带宽只有 1 uop。)上面提到的所有指令也是单 uop,并且运行在多个端口上(所以在吞吐量上不一定相互冲突)。他们的 128 位版本只需要 SSE2。


如果您真的一次只有一个向量要计数,并且没有在内存中循环,那么可能pcmpeqb// psadbwpshufd将高半复制到低)paddd//movd eax, xmm0给您 255 * 整数寄存器中的匹配数。一个额外的向量指令(如从零减去,或与 1 的与,或pabsb(绝对值)将删除 x255 比例因子。


IDK 如何在 C# SIMD 中编写它,但您绝对不想要点积!解包并转换为 FP 会比上面的慢 4 倍,这只是因为固定宽度的向量比浮点数多 4 倍,而且dpps( _mm_dp_ps)并不快。Skylake 上每 1.5 个周期吞吐量 4 微指令和 1 个微指令。如果您确实必须对无符号字节以外的内容进行水平求和,请参阅进行水平 SSE 向量求和(或其他减少)的最快方法(我的答案还包括整数)。

或者,如果Vector.Dotpmaddubsw/pmaddwd用于整数向量,那么这可能没有那么糟糕,但是与比较结果的每个向量进行多步水平求和相比psadbw,或者特别是对于你偶尔只进行水平求和的字节累加器来说是糟糕的。

或者,如果 C# 优化了任何与1. 无论如何,这个答案的第一部分是您希望 CPU 运行的代码。让它发生,但是你喜欢使用任何源代码让它发生。

于 2018-03-30T04:57:16.883 回答
4

我知道我迟到了,但到目前为止,这里的答案都没有真正提供完整的解决方案。这是我的最佳尝试,源自此 GistDotNet 源代码。所有功劳都归功于 DotNet 团队和社区成员(尤其是@Peter Cordes)。

用法:

var bytes = Encoding.ASCII.GetBytes("The quick brown fox jumps over the lazy dog.");
var byteCount = bytes.OccurrencesOf(32);

var chars = "The quick brown fox jumps over the lazy dog.";
var charCount = chars.OccurrencesOf(' ');

代码:

public static class VectorExtensions
{
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static nuint GetByteVector128SpanLength(nuint offset, int length) =>
        ((nuint)(uint)((length - (int)offset) & ~(Vector128<byte>.Count - 1)));
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static nuint GetByteVector256SpanLength(nuint offset, int length) =>
        ((nuint)(uint)((length - (int)offset) & ~(Vector256<byte>.Count - 1)));
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static nint GetCharVector128SpanLength(nint offset, nint length) =>
        ((length - offset) & ~(Vector128<ushort>.Count - 1));
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static nint GetCharVector256SpanLength(nint offset, nint length) =>
        ((length - offset) & ~(Vector256<ushort>.Count - 1));
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static Vector128<byte> LoadVector128(ref byte start, nuint offset) =>
        Unsafe.ReadUnaligned<Vector128<byte>>(ref Unsafe.AddByteOffset(ref start, offset));
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static Vector256<byte> LoadVector256(ref byte start, nuint offset) =>
        Unsafe.ReadUnaligned<Vector256<byte>>(ref Unsafe.AddByteOffset(ref start, offset));
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static Vector128<ushort> LoadVector128(ref char start, nint offset) =>
        Unsafe.ReadUnaligned<Vector128<ushort>>(ref Unsafe.As<char, byte>(ref Unsafe.Add(ref start, offset)));
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static Vector256<ushort> LoadVector256(ref char start, nint offset) =>
        Unsafe.ReadUnaligned<Vector256<ushort>>(ref Unsafe.As<char, byte>(ref Unsafe.Add(ref start, offset)));
    [MethodImpl(MethodImplOptions.AggressiveOptimization)]
    private static unsafe int OccurrencesOf(ref byte searchSpace, byte value, int length) {
        var lengthToExamine = ((nuint)length);
        var offset = ((nuint)0);
        var result = 0L;

        if (Sse2.IsSupported || Avx2.IsSupported) {
            if (31 < length) {
                lengthToExamine = UnalignedCountVector128(ref searchSpace);
            }
        }

    SequentialScan:
        while (7 < lengthToExamine) {
            ref byte current = ref Unsafe.AddByteOffset(ref searchSpace, offset);

            if (value == current) {
                ++result;
            }
            if (value == Unsafe.AddByteOffset(ref current, 1)) {
                ++result;
            }
            if (value == Unsafe.AddByteOffset(ref current, 2)) {
                ++result;
            }
            if (value == Unsafe.AddByteOffset(ref current, 3)) {
                ++result;
            }
            if (value == Unsafe.AddByteOffset(ref current, 4)) {
                ++result;
            }
            if (value == Unsafe.AddByteOffset(ref current, 5)) {
                ++result;
            }
            if (value == Unsafe.AddByteOffset(ref current, 6)) {
                ++result;
            }
            if (value == Unsafe.AddByteOffset(ref current, 7)) {
                ++result;
            }

            lengthToExamine -= 8;
            offset += 8;
        }

        while (3 < lengthToExamine) {
            ref byte current = ref Unsafe.AddByteOffset(ref searchSpace, offset);

            if (value == current) {
                ++result;
            }
            if (value == Unsafe.AddByteOffset(ref current, 1)) {
                ++result;
            }
            if (value == Unsafe.AddByteOffset(ref current, 2)) {
                ++result;
            }
            if (value == Unsafe.AddByteOffset(ref current, 3)) {
                ++result;
            }

            lengthToExamine -= 4;
            offset += 4;
        }

        while (0 < lengthToExamine) {
            if (value == Unsafe.AddByteOffset(ref searchSpace, offset)) {
                ++result;
            }

            --lengthToExamine;
            ++offset;
        }

        if (offset < ((nuint)(uint)length)) {
            if (Avx2.IsSupported) {
                if (0 != (((nuint)(uint)Unsafe.AsPointer(ref searchSpace) + offset) & (nuint)(Vector256<byte>.Count - 1))) {
                    var sum = Sse2.SumAbsoluteDifferences(Sse2.Subtract(Vector128<byte>.Zero, Sse2.CompareEqual(Vector128.Create(value), LoadVector128(ref searchSpace, offset))).AsByte(), Vector128<byte>.Zero).AsInt64();

                    offset += 16;
                    result += (sum.GetElement(0) + sum.GetElement(1));
                }

                lengthToExamine = GetByteVector256SpanLength(offset, length);

                var searchMask = Vector256.Create(value);

                if (127 < lengthToExamine) {
                    var sum = Vector256<long>.Zero;

                    do {
                        var accumulator0 = Vector256<byte>.Zero;
                        var accumulator1 = Vector256<byte>.Zero;
                        var accumulator2 = Vector256<byte>.Zero;
                        var accumulator3 = Vector256<byte>.Zero;
                        var loopIndex = ((nuint)0);
                        var loopLimit = Math.Min(255, (lengthToExamine / 128));

                        do {
                            accumulator0 = Avx2.Subtract(accumulator0, Avx2.CompareEqual(searchMask, LoadVector256(ref searchSpace, offset)));
                            accumulator1 = Avx2.Subtract(accumulator1, Avx2.CompareEqual(searchMask, LoadVector256(ref searchSpace, (offset + 32))));
                            accumulator2 = Avx2.Subtract(accumulator2, Avx2.CompareEqual(searchMask, LoadVector256(ref searchSpace, (offset + 64))));
                            accumulator3 = Avx2.Subtract(accumulator3, Avx2.CompareEqual(searchMask, LoadVector256(ref searchSpace, (offset + 96))));
                            loopIndex++;
                            offset += 128;
                        } while (loopIndex < loopLimit);

                        lengthToExamine -= (128 * loopLimit);
                        sum = Avx2.Add(sum, Avx2.SumAbsoluteDifferences(accumulator0.AsByte(), Vector256<byte>.Zero).AsInt64());
                        sum = Avx2.Add(sum, Avx2.SumAbsoluteDifferences(accumulator1.AsByte(), Vector256<byte>.Zero).AsInt64());
                        sum = Avx2.Add(sum, Avx2.SumAbsoluteDifferences(accumulator2.AsByte(), Vector256<byte>.Zero).AsInt64());
                        sum = Avx2.Add(sum, Avx2.SumAbsoluteDifferences(accumulator3.AsByte(), Vector256<byte>.Zero).AsInt64());
                    } while (127 < lengthToExamine);

                    var sumX = Avx2.ExtractVector128(sum, 0);
                    var sumY = Avx2.ExtractVector128(sum, 1);
                    var sumZ = Sse2.Add(sumX, sumY);

                    result += (sumZ.GetElement(0) + sumZ.GetElement(1));
                }

                if (31 < lengthToExamine) {
                    var sum = Vector256<long>.Zero;

                    do {
                        sum = Avx2.Add(sum, Avx2.SumAbsoluteDifferences(Avx2.Subtract(Vector256<byte>.Zero, Avx2.CompareEqual(searchMask, LoadVector256(ref searchSpace, offset))).AsByte(), Vector256<byte>.Zero).AsInt64());
                        lengthToExamine -= 32;
                        offset += 32;
                    } while (31 < lengthToExamine);

                    var sumX = Avx2.ExtractVector128(sum, 0);
                    var sumY = Avx2.ExtractVector128(sum, 1);
                    var sumZ = Sse2.Add(sumX, sumY);

                    result += (sumZ.GetElement(0) + sumZ.GetElement(1));
                }

                if (offset < ((nuint)(uint)length)) {
                    lengthToExamine = (((nuint)(uint)length) - offset);

                    goto SequentialScan;
                }
            }
            else if (Sse2.IsSupported) {
                lengthToExamine = GetByteVector128SpanLength(offset, length);

                var searchMask = Vector128.Create(value);

                if (63 < lengthToExamine) {
                    var sum = Vector128<long>.Zero;

                    do {
                        var accumulator0 = Vector128<byte>.Zero;
                        var accumulator1 = Vector128<byte>.Zero;
                        var accumulator2 = Vector128<byte>.Zero;
                        var accumulator3 = Vector128<byte>.Zero;
                        var loopIndex = ((nuint)0);
                        var loopLimit = Math.Min(255, (lengthToExamine / 64));

                        do {
                            accumulator0 = Sse2.Subtract(accumulator0, Sse2.CompareEqual(searchMask, LoadVector128(ref searchSpace, offset)));
                            accumulator1 = Sse2.Subtract(accumulator1, Sse2.CompareEqual(searchMask, LoadVector128(ref searchSpace, (offset + 16))));
                            accumulator2 = Sse2.Subtract(accumulator2, Sse2.CompareEqual(searchMask, LoadVector128(ref searchSpace, (offset + 32))));
                            accumulator3 = Sse2.Subtract(accumulator3, Sse2.CompareEqual(searchMask, LoadVector128(ref searchSpace, (offset + 48))));
                            loopIndex++;
                            offset += 64;
                        } while (loopIndex < loopLimit);

                        lengthToExamine -= (64 * loopLimit);
                        sum = Sse2.Add(sum, Sse2.SumAbsoluteDifferences(accumulator0.AsByte(), Vector128<byte>.Zero).AsInt64());
                        sum = Sse2.Add(sum, Sse2.SumAbsoluteDifferences(accumulator1.AsByte(), Vector128<byte>.Zero).AsInt64());
                        sum = Sse2.Add(sum, Sse2.SumAbsoluteDifferences(accumulator2.AsByte(), Vector128<byte>.Zero).AsInt64());
                        sum = Sse2.Add(sum, Sse2.SumAbsoluteDifferences(accumulator3.AsByte(), Vector128<byte>.Zero).AsInt64());
                    } while (63 < lengthToExamine);

                    result += (sum.GetElement(0) + sum.GetElement(1));
                }

                if (15 < lengthToExamine) {
                    var sum = Vector128<long>.Zero;

                    do {
                        sum = Sse2.Add(sum, Sse2.SumAbsoluteDifferences(Sse2.Subtract(Vector128<byte>.Zero, Sse2.CompareEqual(searchMask, LoadVector128(ref searchSpace, offset))).AsByte(), Vector128<byte>.Zero).AsInt64());
                        lengthToExamine -= 16;
                        offset += 16;
                    } while (15 < lengthToExamine);

                    result += (sum.GetElement(0) + sum.GetElement(1));
                }

                if (offset < ((nuint)(uint)length)) {
                    lengthToExamine = (((nuint)(uint)length) - offset);

                    goto SequentialScan;
                }
            }
        }

        return ((int)result);
    }
    [MethodImpl(MethodImplOptions.AggressiveOptimization)]
    private static unsafe int OccurrencesOf(ref char searchSpace, char value, int length) {
        var lengthToExamine = ((nint)length);
        var offset = ((nint)0);
        var result = 0L;

        if (0 != ((int)Unsafe.AsPointer(ref searchSpace) & 1)) { }
        else if (Sse2.IsSupported || Avx2.IsSupported) {
            if (15 < length) {
                lengthToExamine = UnalignedCountVector128(ref searchSpace);
            }
        }

    SequentialScan:
        while (3 < lengthToExamine) {
            ref char current = ref Unsafe.Add(ref searchSpace, offset);

            if (value == current) {
                ++result;
            }
            if (value == Unsafe.Add(ref current, 1)) {
                ++result;
            }
            if (value == Unsafe.Add(ref current, 2)) {
                ++result;
            }
            if (value == Unsafe.Add(ref current, 3)) {
                ++result;
            }

            lengthToExamine -= 4;
            offset += 4;
        }

        while (0 < lengthToExamine) {
            if (value == Unsafe.Add(ref searchSpace, offset)) {
                ++result;
            }

            --lengthToExamine;
            ++offset;
        }

        if (offset < length) {
            if (Avx2.IsSupported) {
                if (0 != (((nint)Unsafe.AsPointer(ref Unsafe.Add(ref searchSpace, offset))) & (Vector256<byte>.Count - 1))) {
                    var sum = Sse2.SumAbsoluteDifferences(Sse2.Subtract(Vector128<ushort>.Zero, Sse2.CompareEqual(Vector128.Create(value), LoadVector128(ref searchSpace, offset))).AsByte(), Vector128<byte>.Zero).AsInt64();

                    offset += 8;
                    result += (sum.GetElement(0) + sum.GetElement(1));
                }

                lengthToExamine = GetCharVector256SpanLength(offset, length);

                var searchMask = Vector256.Create(value);

                if (63 < lengthToExamine) {
                    var sum = Vector256<long>.Zero;

                    do {
                        var accumulator0 = Vector256<ushort>.Zero;
                        var accumulator1 = Vector256<ushort>.Zero;
                        var accumulator2 = Vector256<ushort>.Zero;
                        var accumulator3 = Vector256<ushort>.Zero;
                        var loopIndex = 0;
                        var loopLimit = Math.Min(255, (lengthToExamine / 64));

                        do {
                            accumulator0 = Avx2.Subtract(accumulator0, Avx2.CompareEqual(searchMask, LoadVector256(ref searchSpace, offset)));
                            accumulator1 = Avx2.Subtract(accumulator1, Avx2.CompareEqual(searchMask, LoadVector256(ref searchSpace, (offset + 16))));
                            accumulator2 = Avx2.Subtract(accumulator2, Avx2.CompareEqual(searchMask, LoadVector256(ref searchSpace, (offset + 32))));
                            accumulator3 = Avx2.Subtract(accumulator3, Avx2.CompareEqual(searchMask, LoadVector256(ref searchSpace, (offset + 48))));
                            loopIndex++;
                            offset += 64;
                        } while (loopIndex < loopLimit);

                        lengthToExamine -= (64 * loopLimit);
                        sum = Avx2.Add(sum, Avx2.SumAbsoluteDifferences(accumulator0.AsByte(), Vector256<byte>.Zero).AsInt64());
                        sum = Avx2.Add(sum, Avx2.SumAbsoluteDifferences(accumulator1.AsByte(), Vector256<byte>.Zero).AsInt64());
                        sum = Avx2.Add(sum, Avx2.SumAbsoluteDifferences(accumulator2.AsByte(), Vector256<byte>.Zero).AsInt64());
                        sum = Avx2.Add(sum, Avx2.SumAbsoluteDifferences(accumulator3.AsByte(), Vector256<byte>.Zero).AsInt64());
                    } while (63 < lengthToExamine);

                    var sumX = Avx2.ExtractVector128(sum, 0);
                    var sumY = Avx2.ExtractVector128(sum, 1);
                    var sumZ = Sse2.Add(sumX, sumY);

                    result += (sumZ.GetElement(0) + sumZ.GetElement(1));
                }

                if (15 < lengthToExamine) {
                    var sum = Vector256<long>.Zero;

                    do {
                        sum = Avx2.Add(sum, Avx2.SumAbsoluteDifferences(Avx2.Subtract(Vector256<ushort>.Zero, Avx2.CompareEqual(searchMask, LoadVector256(ref searchSpace, offset))).AsByte(), Vector256<byte>.Zero).AsInt64());
                        lengthToExamine -= 16;
                        offset += 16;
                    } while (15 < lengthToExamine);

                    var sumX = Avx2.ExtractVector128(sum, 0);
                    var sumY = Avx2.ExtractVector128(sum, 1);
                    var sumZ = Sse2.Add(sumX, sumY);

                    result += (sumZ.GetElement(0) + sumZ.GetElement(1));
                }

                if (offset < length) {
                    lengthToExamine = (length - offset);

                    goto SequentialScan;
                }
            }
            else if (Sse2.IsSupported) {
                lengthToExamine = GetCharVector128SpanLength(offset, length);

                var searchMask = Vector128.Create(value);

                if (31 < lengthToExamine) {
                    var sum = Vector128<long>.Zero;

                    do {
                        var accumulator0 = Vector128<ushort>.Zero;
                        var accumulator1 = Vector128<ushort>.Zero;
                        var accumulator2 = Vector128<ushort>.Zero;
                        var accumulator3 = Vector128<ushort>.Zero;
                        var loopIndex = 0;
                        var loopLimit = Math.Min(255, (lengthToExamine / 32));

                        do {
                            accumulator0 = Sse2.Subtract(accumulator0, Sse2.CompareEqual(searchMask, LoadVector128(ref searchSpace, offset)));
                            accumulator1 = Sse2.Subtract(accumulator1, Sse2.CompareEqual(searchMask, LoadVector128(ref searchSpace, (offset + 8))));
                            accumulator2 = Sse2.Subtract(accumulator2, Sse2.CompareEqual(searchMask, LoadVector128(ref searchSpace, (offset + 16))));
                            accumulator3 = Sse2.Subtract(accumulator3, Sse2.CompareEqual(searchMask, LoadVector128(ref searchSpace, (offset + 24))));
                            loopIndex++;
                            offset += 32;
                        } while (loopIndex < loopLimit);

                        lengthToExamine -= (32 * loopLimit);
                        sum = Sse2.Add(sum, Sse2.SumAbsoluteDifferences(accumulator0.AsByte(), Vector128<byte>.Zero).AsInt64());
                        sum = Sse2.Add(sum, Sse2.SumAbsoluteDifferences(accumulator1.AsByte(), Vector128<byte>.Zero).AsInt64());
                        sum = Sse2.Add(sum, Sse2.SumAbsoluteDifferences(accumulator2.AsByte(), Vector128<byte>.Zero).AsInt64());
                        sum = Sse2.Add(sum, Sse2.SumAbsoluteDifferences(accumulator3.AsByte(), Vector128<byte>.Zero).AsInt64());
                    } while (31 < lengthToExamine);

                    result += (sum.GetElement(0) + sum.GetElement(1));
                }

                if (7 < lengthToExamine) {
                    var sum = Vector128<long>.Zero;

                    do {
                        sum = Sse2.Add(sum, Sse2.SumAbsoluteDifferences(Sse2.Subtract(Vector128<ushort>.Zero, Sse2.CompareEqual(searchMask, LoadVector128(ref searchSpace, offset))).AsByte(), Vector128<byte>.Zero).AsInt64());
                        lengthToExamine -= 8;
                        offset += 8;
                    } while (7 < lengthToExamine);

                    result += (sum.GetElement(0) + sum.GetElement(1));
                }

                if (offset < length) {
                    lengthToExamine = (length - offset);

                    goto SequentialScan;
                }
            }
        }

        return ((int)result);
    }
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static unsafe nuint UnalignedCountVector128(ref byte searchSpace) {
        nint unaligned = ((nint)Unsafe.AsPointer(ref searchSpace) & (Vector128<byte>.Count - 1));

        return ((nuint)(uint)((Vector128<byte>.Count - unaligned) & (Vector128<byte>.Count - 1)));
    }
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static unsafe nint UnalignedCountVector128(ref char searchSpace) {
        const int ElementsPerByte = (sizeof(ushort) / sizeof(byte));

        return ((nint)(uint)(-(int)Unsafe.AsPointer(ref searchSpace) / ElementsPerByte) & (Vector128<ushort>.Count - 1));
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static int OccurrencesOf(this ReadOnlySpan<byte> span, byte value) =>
        OccurrencesOf(
            length: span.Length,
            searchSpace: ref MemoryMarshal.GetReference(span),
            value: value
        );
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static int OccurrencesOf(this Span<byte> span, byte value) =>
        ((ReadOnlySpan<byte>)span).OccurrencesOf(value);
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static int OccurrencesOf(this ReadOnlySpan<char> span, char value) =>
        OccurrencesOf(
            length: span.Length,
            searchSpace: ref MemoryMarshal.GetReference(span),
            value: value
        );
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static int OccurrencesOf(this Span<char> span, char value) =>
        ((ReadOnlySpan<char>)span).OccurrencesOf(value);
}
于 2021-09-27T18:27:03.057 回答
1

这是 C 中的快速 SSE2 实现:

size_t memcount_sse2(const void *s, int c, size_t n) {
   __m128i cv = _mm_set1_epi8(c), sum = _mm_setzero_si128(), acr0,acr1,acr2,acr3;
    const char *p,*pe;                                                                         
    for(p = s; p != (char *)s+(n- (n % (252*16)));) { 
      for(acr0 = acr1 = acr2 = acr3 = _mm_setzero_si128(),pe = p+252*16; p != pe; p += 64) { 
        acr0 = _mm_add_epi8(acr0, _mm_cmpeq_epi8(cv, _mm_loadu_si128((const __m128i *)p))); 
        acr1 = _mm_add_epi8(acr1, _mm_cmpeq_epi8(cv, _mm_loadu_si128((const __m128i *)(p+16)))); 
        acr2 = _mm_add_epi8(acr2, _mm_cmpeq_epi8(cv, _mm_loadu_si128((const __m128i *)(p+32)))); 
        acr3 = _mm_add_epi8(acr3, _mm_cmpeq_epi8(cv, _mm_loadu_si128((const __m128i *)(p+48))));
        __builtin_prefetch(p+1024);
      }
      sum = _mm_add_epi64(sum, _mm_sad_epu8(_mm_sub_epi8(_mm_setzero_si128(), acr0), _mm_setzero_si128()));
      sum = _mm_add_epi64(sum, _mm_sad_epu8(_mm_sub_epi8(_mm_setzero_si128(), acr1), _mm_setzero_si128()));
      sum = _mm_add_epi64(sum, _mm_sad_epu8(_mm_sub_epi8(_mm_setzero_si128(), acr2), _mm_setzero_si128()));
      sum = _mm_add_epi64(sum, _mm_sad_epu8(_mm_sub_epi8(_mm_setzero_si128(), acr3), _mm_setzero_si128()));
    }

    // may require SSE4, rewrite this part for actual SSE2.
    size_t count = _mm_extract_epi64(sum, 0) + _mm_extract_epi64(sum, 1);

    // scalar cleanup.  Could be optimized.
    while(p != (char *)s + n) count += *p++ == c;
    return count;
}

并参见:https : //gist.github.com/powturbo 和 avx2 实现。

于 2018-04-09T21:15:31.333 回答
1

感谢Marc Gravell的提示,以下工作:

var areEqual = Vector.Equals(vBytes, mask);
var negation = Vector.Negate(areEqual);
var count = Vector.Dot(negation, Vector<byte>.One);

Marc 有一篇文,其中包含有关该主题的更多信息。

于 2018-03-29T10:10:29.053 回答