c++ - 大写单词的最快方法

Question

使用 C++ 大写单词（std::string）的最快方法是什么？

在使用带有 -O3 标志的 g++ 4.6.3 的 Debian Linux 上，此函数使用boost::to_lower将在 AMD Phenom(tm) II X6 1090T 处理器 (3200 MHz) 上的单个执行线程中在大约 24 秒内大写 81,450,625 个字。

void Capitalize( std::string& word )
{
    boost::to_lower( word );
    word[0] = toupper( word[0] );
}

这个函数使用std::transform大约 10 秒做同样的事情。我在测试之间清除了虚拟机，所以我不认为这种差异是侥幸：

sync && echo 3 > /proc/sys/vm/drop_caches

void Capitalize( std::string& word )
{
    std::transform(word.begin(), word.end(), word.begin(), ::tolower);
    word[0] = toupper( word[0] );
}

有更快的方法吗？我不想为了速度而失去可移植性，但是如果有更快的方法可以在 std C++ 或带有 boost 的 std C++ 中工作，我想尝试一下。

谢谢。

score 6 · Accepted Answer

在处理不能保证输入为大写的 DNA 序列以及boost::to_upper代码中的瓶颈在哪里时，有这个确切的问题。更改为：

template<typename T_it>
void SequenceToUpperCase( T_it begin, T_it end )
{
    // Convert to upper: clear the '32' bit, 0x20 in hex. And with the
    // inverted bit string (~).
    for ( auto it = begin; it != end; ++it )
        *it &= ~0x20;
}

带来了巨大的速度提升。我确信可以通过例如一次翻转 8 个字节来进一步优化，但是使用上面的代码，大写对我们来说几乎是瞬时的。对于小写：做：

        *it |= 0x20;

score 5 · Accepted Answer

使其更快的几种方法：
1. 不要使用to_lower，它很慢。也不要使用ifs，创建一个包含 256 个条目的表，这些条目从字符映射到其小写版本，另一个表用于大写。
2.不要使用transform，取一个指向第一个字符的指针并循环直到空终止符。
3. 如果内存不是问题，请使用映射 2 个字符序列的表。在这种情况下，您将需要另一个处理终止的表。
4.如果你能在汇编中做到这一点，它会快得多。

score 2 · Accepted Answer

如果大写是一个真正的瓶颈，那么使用手写循环和内联 toupper/tolower 函数编写自己的大写实现。如有必要，请使用 ASM。

score 1 · Accepted Answer

我有一个实现，我发现它比 std::transform 更快，在 g++ -03 Fedora 18 中编译。

以秒为单位的表演时间：
转换耗时：11 秒
我的实施耗时：2 s
测试数据大小 = 26*15*9999999 个字符

inline void tolowerPtr(char *p) ;

inline void tolowerStr(std::string& s)
{char* c=const_cast<char*>(s.c_str());
size_t l = s.size();
  for(char* c2=c;c2<c+l;c2++)tolowerPtr(c2); 
};

inline void tolowerPtr(char *p) 
{
switch(*p)
{
  case 'A':*p='a'; return;
  case 'B':*p='b'; return;
  case 'C':*p='c'; return;
  case 'D':*p='d'; return;
  case 'E':*p='e'; return;
  case 'F':*p='f'; return;
  case 'G':*p='g'; return;
  case 'H':*p='h'; return;
  case 'I':*p='i'; return;
  case 'J':*p='j'; return;
  case 'K':*p='k'; return;
  case 'L':*p='l'; return;
  case 'M':*p='m'; return;
  case 'N':*p='n'; return;
  case 'O':*p='o'; return;
  case 'P':*p='p'; return;
  case 'Q':*p='q'; return;
  case 'R':*p='r'; return;
  case 'S':*p='s'; return;
  case 'T':*p='t'; return;
  case 'U':*p='u'; return;
  case 'V':*p='v'; return;
  case 'W':*p='w'; return;
  case 'X':*p='x'; return;
  case 'Y':*p='y'; return;
  case 'Z':*p='z'; return;
};
return ;
}

void testtransform( std::string& word )
{
std::string word2=word; 
time_t t;
time_t t2;
time(&t);
std::cout << "testtransform: start " << "\n";
int i=0;
for(;i<9999999;i++) 
{    word2=word;
    std::transform(word2.begin(), word2.end(), word2.begin(), ::tolower);
}
time(&t2);
std::cout << word2 << "\n";
std::cout << "testtransform: end " << i << ":"<< t2-t << "\n";
}

void testmytolower( std::string& word )
{
std::string word2=word; 
time_t t;
time_t t2;
time(&t);
std::cout << "testmytolower: start " << "\n";
int i=0;
for(;i<9999999;i++)
{   word2=word;
    cstralgo::tolowerStr(word2);
}
time(&t2);
std::cout << word2 << "\n";
std::cout << "testmytolower: end " << i << ":"<< t2-t << "\n";
}

int main(int argc, char* argv[])
{
   std::string word ="ABCDEFGHIJKLMNOPQRSTUVWXYZ";
   word =word+word+word+word+word+word+word+word+word+word+word+word+word+word+word;
   testtransform( word);
   testmytolower( word);
   return 0;
}

我很高兴知道性能是否可以进一步提高。

score 0 · Accepted Answer

I would use a for loop to go through the entire string, character by character, parsing them to a function to convert to uppercase. To hopefully speed things up, I'll write the capitalisation function in assembly. The C++ component of the application would be like so:

#include <iostream>
#include <string>

extern "C" char asm_capt(char x);

using namespace std;

int main(void)
{
    string str;
    cin >> str;
    string tmp = str;

    for(int i=0; i<str.length(); i++){

        tmp.at(i) = asm_capt(str.at(i));
    }

}

Now comes the assembly part; I'll assume you are using Windows and the MASM compiler. The code should be saved within a .asm source file, and included in the project with MASM build settings:

.model flat
.code
_asm_capt proc

    mov rax, rcx
    cmp rax, 61h
    jl already_capt

    sub rax, 20h
    ret

    already_capt:

    ret

_asm_capt endp
end

Essentially, it checks if the character (in hexadecimal) is less than 0x61, which would imply, if you only use letters, that it is already capitalised. otherwise, the value is decreased by 0x20, which moves it to the lower case equivalent. (See ASCII table.)

Note: By default, the return parameter is stored in the RAX register; the first parameter passed by C++ to assembly is stored in RCX.

score -1 · Accepted Answer

Required pseudocode would be:



for(int i=0 to given_str.length)

{ upper[i]=(char*)given_str[i]-32;  // in ascii tbl, any lower case char - upper case char=32

}

return upper;

c++ - 大写单词的最快方法

6 回答 6

Related

Reference