c++ - 在 C++ 中将 32 位数字拆分为字节的最快方法

Question

我正在编写一段代码，旨在对 CLSID 结构进行一些数据压缩。我将它们存储为 128 位整数的压缩流。但是，有问题的代码必须能够将无效的 CLSID 放入流中。为了做到这一点，我把它们留成了一根大绳子。在磁盘上，它看起来像这样：

+--------------------------+-----------------+------------------------+
|                          |                 |                        |
| Length of Invalid String | Invalid String  | Compressed Data Stream |
|                          |                 |                        |
+--------------------------+-----------------+------------------------+

为了编码字符串的长度，我需要输出 32 位整数，即字符串的长度，一次一个字节。这是我当前的代码：

std::vector<BYTE> compressedBytes;
DWORD invalidLength = (DWORD) invalidClsids.length();
compressedBytes.push_back((BYTE)  invalidLength        & 0x000000FF);
compressedBytes.push_back((BYTE) (invalidLength >>= 8) & 0x000000FF));
compressedBytes.push_back((BYTE) (invalidLength >>= 8) & 0x000000FF));
compressedBytes.push_back((BYTE) (invalidLength >>= 8));

这段代码不会经常被调用，但在解码阶段需要有一个类似的结构被调用数千次。我很好奇这是否是最有效的方法，或者是否有人可以提出更好的方法？

谢谢大家！

比利3

编辑：查看了一些答案后，我创建了这个迷你测试程序，看看哪个是最快的：

// temp.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include <windows.h>
#include <ctime>
#include <iostream>
#include <vector>

void testAssignedShifts();
void testRawShifts();
void testUnion();

int _tmain(int argc, _TCHAR* argv[])
{
    std::clock_t startTime = std::clock();
    for (register unsigned __int32 forLoopTest = 0; forLoopTest < 0x008FFFFF; forLoopTest++)
    {
        testAssignedShifts();
    }
    std::clock_t assignedShiftsFinishedTime = std::clock();
    for (register unsigned __int32 forLoopTest = 0; forLoopTest < 0x008FFFFF; forLoopTest++)
    {
        testRawShifts();
    }
    std::clock_t rawShiftsFinishedTime = std::clock();
    for (register unsigned __int32 forLoopTest = 0; forLoopTest < 0x008FFFFF; forLoopTest++)
    {
        testUnion();
    }
    std::clock_t unionFinishedTime = std::clock();
    std::printf(
        "Execution time for assigned shifts: %08u clocks\n"
        "Execution time for raw shifts:      %08u clocks\n"
        "Execution time for union:           %08u clocks\n\n",
        assignedShiftsFinishedTime - startTime,
        rawShiftsFinishedTime - assignedShiftsFinishedTime,
        unionFinishedTime - rawShiftsFinishedTime);
    startTime = std::clock();
    for (register unsigned __int32 forLoopTest = 0; forLoopTest < 0x008FFFFF; forLoopTest++)
    {
        testAssignedShifts();
    }
    assignedShiftsFinishedTime = std::clock();
    for (register unsigned __int32 forLoopTest = 0; forLoopTest < 0x008FFFFF; forLoopTest++)
    {
        testRawShifts();
    }
    rawShiftsFinishedTime = std::clock();
    for (register unsigned __int32 forLoopTest = 0; forLoopTest < 0x008FFFFF; forLoopTest++)
    {
        testUnion();
    }
    unionFinishedTime = std::clock();
    std::printf(
        "Execution time for assigned shifts: %08u clocks\n"
        "Execution time for raw shifts:      %08u clocks\n"
        "Execution time for union:           %08u clocks\n\n"
        "Finished. Terminate!\n\n",
        assignedShiftsFinishedTime - startTime,
        rawShiftsFinishedTime - assignedShiftsFinishedTime,
        unionFinishedTime - rawShiftsFinishedTime);

    system("pause");
    return 0;
}

void testAssignedShifts()
{
    std::string invalidClsids("This is a test string");
    std::vector<BYTE> compressedBytes;
    DWORD invalidLength = (DWORD) invalidClsids.length();
    compressedBytes.push_back((BYTE)  invalidLength);
    compressedBytes.push_back((BYTE) (invalidLength >>= 8));
    compressedBytes.push_back((BYTE) (invalidLength >>= 8));
    compressedBytes.push_back((BYTE) (invalidLength >>= 8));
}
void testRawShifts()
{
    std::string invalidClsids("This is a test string");
    std::vector<BYTE> compressedBytes;
    DWORD invalidLength = (DWORD) invalidClsids.length();
    compressedBytes.push_back((BYTE) invalidLength);
    compressedBytes.push_back((BYTE) (invalidLength >>  8));
    compressedBytes.push_back((BYTE) (invalidLength >>  16));
    compressedBytes.push_back((BYTE) (invalidLength >>  24));
}

typedef union _choice
{
    DWORD dwordVal;
    BYTE bytes[4];
} choice;

void testUnion()
{
    std::string invalidClsids("This is a test string");
    std::vector<BYTE> compressedBytes;
    choice invalidLength;
    invalidLength.dwordVal = (DWORD) invalidClsids.length();
    compressedBytes.push_back(invalidLength.bytes[0]);
    compressedBytes.push_back(invalidLength.bytes[1]);
    compressedBytes.push_back(invalidLength.bytes[2]);
    compressedBytes.push_back(invalidLength.bytes[3]);
}

运行几次会导致：

Execution time for assigned shifts: 00012484 clocks
Execution time for raw shifts:      00012578 clocks
Execution time for union:           00013172 clocks

Execution time for assigned shifts: 00012594 clocks
Execution time for raw shifts:      00013140 clocks
Execution time for union:           00012782 clocks

Execution time for assigned shifts: 00012500 clocks
Execution time for raw shifts:      00012515 clocks
Execution time for union:           00012531 clocks

Execution time for assigned shifts: 00012391 clocks
Execution time for raw shifts:      00012469 clocks
Execution time for union:           00012500 clocks

Execution time for assigned shifts: 00012500 clocks
Execution time for raw shifts:      00012562 clocks
Execution time for union:           00012422 clocks

Execution time for assigned shifts: 00012484 clocks
Execution time for raw shifts:      00012407 clocks
Execution time for union:           00012468 clocks

看起来是关于分配班次和工会之间的联系。由于我稍后将需要该值，因此它是联合！谢谢！

比利3

score 8 · Accepted Answer

这可能与您将得到的一样优化。位旋转操作是处理器上最快的一些操作。

>> 16, >> 24 而不是 >>= 8 >>= 8 可能会更快-您减少了分配。

另外，我认为您不需要 & - 因为您要转换为 BYTE（应该是 8 位字符），所以无论如何它都会被适当地截断。（是吗？如果我错了请纠正我）

不过，总而言之，这些都是很小的变化。对其进行分析以查看它是否真的有所作为：P

score 6 · Accepted Answer

只需使用联合：

assert(sizeof (DWORD) == sizeof (BYTE[4]));   // Sanity check

union either {
    DWORD dw;
    struct {
         BYTE b[4];
    } bytes;
};

either invalidLength;
invalidLength.dw = (DWORD) invalidClsids.length();
compressedBytes.push_back(either.bytes.b[0]);
compressedBytes.push_back(either.bytes.b[1]);
compressedBytes.push_back(either.bytes.b[2]);
compressedBytes.push_back(either.bytes.b[3]);

注意：与原始问题中的位移方法不同，此代码产生与字节序相关的输出。 仅当在一台计算机上运行的程序的输出将在具有不同字节顺序的计算机上读取时，这才重要——但由于使用此方法似乎没有可测量的速度增加，您不妨使用更便携的位移方法，以防万一。

score 2 · Accepted Answer

您应该衡量而不是猜测任何潜在的改进，但我的第一个想法是按如下方式进行联合可能会更快：

typedef union {
    DWORD d;
    struct {
        BYTE b0;
        BYTE b1;
        BYTE b2;
        BYTE b3;
    } b;
} DWB;

std::vector<BYTE> compBytes;
DWB invLen;
invLen.d = (DWORD) invalidClsids.length();
compBytes.push_back(invalidLength.b.b3);
compBytes.push_back(invalidLength.b.b2);
compBytes.push_back(invalidLength.b.b1);
compBytes.push_back(invalidLength.b.b0);

这可能是推回的正确顺序，但请检查以防万一——这取决于 CPU 的字节序。

score 1 · Accepted Answer

一种真正快速的方法是将 DWORD*（单元素数组）视为 BYTE*（4 元素数组）。代码也更具可读性。

警告：我没有编译这个

警告：这会使您的代码依赖于字节顺序

std::vector<BYTE> compressedBytes;
DWORD invalidLength = (DWORD) invalidClsids.length();
BYTE* lengthParts = &invalidLength;
static const int kLenghtPartsLength = sizeof(DWORD) / sizeof(BYTE);
for(int i = 0; i < kLenghtPartsLength; ++i)
    compressedBytes.push_back(lengthParts[i]);

score 1 · Accepted Answer

compressedBytes.push_back(either.bytes.b[0]);
compressedBytes.push_back(either.bytes.b[1]);
compressedBytes.push_back(either.bytes.b[2]);
compressedBytes.push_back(either.bytes.b[3]);

还有一种更智能、更快捷的方法！让我们看看这段代码在做什么以及如何改进它。

此代码正在序列化整数，一次一个字节。对于每个字节，它都调用 push_back，它正在检查内部向量缓冲区中的可用空间。如果我们没有空间容纳另一个字节，就会发生内存重新分配（提示，慢！）。当然，重新分配不会经常发生（重新分配通常通过将现有缓冲区加倍来发生）。然后，新字节被复制，内部大小增加一。

vector<> 有一个标准要求，即内部缓冲区是连续的。vector<> 也恰好有一个operator& ()和operator[] ()。

所以，这是你能想到的最好的代码：

std::string invalidClsids("This is a test string");
std::vector<BYTE> compressedBytes;
DWORD invalidLength = (DWORD) invalidClsids.length();
compressedBytes.resize(sizeof(DWORD)); // You probably want to make this much larger, to avoid resizing later.
// compressedBytes is as large as the length we want to serialize.
BYTE* p = &compressedBytes[0]; // This is valid code and designed by the standard for such cases. p points to a buffer that is at least as large as a DWORD.
*((DWORD*)p) = invalidLength;  // Copy all bytes in one go!

上述转换可以使用&compressedBytes[0]语句一次性完成，但不会更快。这更具可读性。

笔记！以这种方式（甚至使用 UNION 方法）进行序列化是依赖于字节序的。也就是说，在 Intel/AMD 处理器上，最低有效字节排在第一位，而在大端机器（PowerPC、摩托罗拉……）上，最高有效字节排在第一位。如果您想保持中立，则必须使用数学方法（班次）。

score 0 · Accepted Answer

你必须一次做一个字节吗？有没有一种方法可以让你一举将整个 32 位 memcpy() 放入流中？如果您有要写入流的缓冲区的地址，您可以复制到那里吗？

score 0 · Accepted Answer

也许可以获得32位变量指针，将其转换为char指针并读取char，然后将+1添加到指针并读取下一个char ..只是理论:)我不知道它是否有效

c++ - 在 C++ 中将 32 位数字拆分为字节的最快方法

7 回答 7

Related

Reference