c++ - 如何在 C++ 中快速比较可变长度位字符串？

Question

我正在根据一组特征的二进制存在或不存在对对象进行比较。这些特征可以用一个位串来表示，例如：

该位串具有第一、第四和第五个特征。

我正在尝试将一对位串的相似性计算为两者共有的特征的数量。对于给定的一组位字符串，我知道它们都将具有相同的长度，但在编译时我不知道该长度是多少。

例如，这两个字符串有两个共同的特征，所以我希望相似度函数返回 2：

s(10011,10010) = 2

如何有效地表示和比较 C++ 中的位串？

score 10 · Accepted Answer

您可以使用std::bitsetSTL 类。

它们可以由位串、ANDed 构建，并计算 1 的个数：

#include <string>
#include <bitset>

int main()
{
  std::bitset<5> option1(std::string("10011")), option2(std::string("10010"));
  std::bitset<5> and_bit = option1 & option2; //bitset will have 1s only on common options
  size_t s = and_bit.count ();                //return the number of 1 in the bitfield
  return 0;
}

编辑

如果编译时位数未知，您可以使用boost::dynamic_bitset<>：

boost::dynamic_bitset<> option(bit_string);

示例的其他部分没有改变，因为boost::dynamic_bitset<>与std::bitset.

score 3 · Accepted Answer

更快的算法：

int similarity(unsigned int a, unsigned int b)
{
   unsigned int r = a & b;
   r = ( r & 0x55555555 ) + ((r >> 1) & 0x55555555 );
   r = ( r & 0x33333333 ) + ((r >> 2) & 0x33333333 );
   r = ( r & 0x0f0f0f0f ) + ((r >> 4) & 0x0f0f0f0f );
   r = ( r & 0x00ff00ff ) + ((r >> 8) & 0x00ff00ff );
   r = ( r & 0x0000ffff ) + ((r >>16) & 0x0000ffff );
   return r;
}

int main() {
        unsigned int a = 19 ;//10011
        unsigned int b = 18 ;//10010
        cout << similarity(a,b) << endl; 
        return 0;
}

输出：

ideone 演示：http ://www.ideone.com/bE4qb

score 2 · Accepted Answer

由于您在编译时不知道位长，因此可以boost::dynamic_bitset使用std::bitset.

然后，您可以使用operator&(或&=) 查找公共位，并使用boost::dynamic_bitset::count().

性能取决于。对于最大速度，取决于您的编译器，您可能必须自己实现循环，例如使用@Nawaz 的方法或来自Bit Twiddling Hacks的方法，或者通过使用 sse/popcount/etc 的汇编器/编译器内在函数编写循环。

请注意，至少 llvm、gcc 和 icc 检测到许多此类模式并为您优化事物，因此在进行手动工作之前分析/检查生成的代码。

score 1 · Accepted Answer

使用 a std::bitset，如果您的特征集小于 long 中的位数（我认为它是 long ），您可以获得位的 unsigned long 表示，然后和两个值，并从这里使用位旋转技巧数。

如果您想继续使用字符串来表示您的位模式，您可以使用zip_iteratorfrom boost 执行以下操作。

#include <iostream>
#include <string>
#include <algorithm>

#include <boost/tuple/tuple.hpp>
#include <boost/iterator/zip_iterator.hpp>

struct check_is_set :
  public std::unary_function<const boost::tuple<char const&, char const&>&, bool>
{
  bool operator()(const boost::tuple<char const&, char const&>& t) const
  {
    const char& cv1 = boost::get<0>(t);
    const char& cv2 = boost::get<1>(t);
    return cv1 == char('1') && cv1 == cv2;
  }
};

size_t count_same(std::string const& opt1, std::string const& opt2)
{
  std::string::const_iterator beg1 = opt1.begin();
  std::string::const_iterator beg2 = opt2.begin();

  // need the same number of items for end (this really is daft, you get a runtime
  // error if the sizes are different otherwise!! I think it's a bug in the
  // zip_iterator implementation...)
  size_t end_s = std::min(opt1.size(), opt2.size());
  std::string::const_iterator end1 = opt1.begin() + end_s;
  std::string::const_iterator end2 = opt2.begin() + end_s;

  return std::count_if(
  boost::make_zip_iterator(
    boost::make_tuple(beg1, beg2)
    ),
  boost::make_zip_iterator(
    boost::make_tuple(end1, end2)
    ),
    check_is_set()
  );
}

int main(void)
{
  std::string opt1("1010111");
  std::string opt2("001101");

  std::cout << "same: " << count_same(opt1, opt2) << std::endl;

  return 0;
}

c++ - 如何在 C++ 中快速比较可变长度位字符串？

4 回答 4

Related

Reference