c++ - 从整数向量列表中删除重复项的快速方法

Question

假设我们有一个函数，它返回一百万个长度为 30 的整数向量，每个向量具有较小的条目（例如，介于 -100 和 100 之间）。进一步假设输出只有大约 30000 个唯一向量，其余的都是重复的。检索唯一输出向量列表的良好数据结构和算法是什么？优选地，当 3% 的唯一向量的比例大致恒定时，该解决方案应该可以很好地扩展。

这个问题主要是关于数据结构的，但我打算使用 STL 在 C++ 中实现它，所以也欢迎任何关于实现的提示。

朴素的算法是存储已知向量的列表（可能按字典顺序排序）。当一个新向量到达时，我们可以使用循环检查它是否已经在列表中（或在排序列表中搜索）。
散列：假设向量存储在 C 数组中。什么是整数向量的好散列函数？我看到的一个缺点是每个向量的每个组件都至少被触摸一次。这似乎已经太多了。
任何树数据结构都好吗？例如，我们可以将所有可见向量的第一个分量中的值存储为根，然后将第二个分量中的值存储为它们的子项，...

我没有计算机科学背景。我也很乐意提供一些文献资料，我可以在这些文献中学习如何解决这些问题。

score 3 · Accepted Answer

您提出的建议有时被称为后备表。用于各种查找目的的辅助表。在您的情况下，您有许多不同的可能方式来组织此表。最明显的是不组织它，使用线性搜索来查看下一个元素是否已知。由于该表最终将包含大约 30000 个元素，因此这可能不是一个好主意。从标准库（至少在 C++11 中），有两种可能性：std::set和 std::unordered_set. std::set使用某种形式的平衡树，因此每次查找最多进行 lg n比较（对于 30000 个元素，大约 15 个）；std::unordered_set是一个哈希表，并且具有良好的哈希函数，将需要尽可能少的常量比较：您应该能够将其平均降低到 2 以下（但可能会以更多内存为代价 - 负载因子越低，碰撞的概率越小）。正如你提到的，你做到了有计算哈希函数的额外成本，正如您所指出的，这确实涉及访问向量中的每个元素；在二叉树中，每次比较所需的只是比较足够的元素以确定顺序——在许多情况下，可能只有一个或两个。（但是，如果您说有很多重复项......在您访问所有 30 个条目之前，您无法检测到重复项，因为任何一个条目都可能会有所不同。）了解哪种解决方案实际上更快的唯一方法是测量两者都使用典型数据；对于您描述的数据集（许多重复项），我怀疑哈希表会获胜，但这远非确定。

最后，您可以使用某种非二叉树。如果您真的可以将值限制在特定范围内（例如 -100..100），则可以使用带有指向子节点的指针的普通向量或数组，直接使用元素值进行索引，并根据需要进行转置。然后，您只需遍历树，直到找到空指针，或者到达终点。树的最大深度为 30，实际上，每个元素的深度为 30，但通常情况下，您会发现元素在深度之前是唯一的。我怀疑（但同样，你需要测量）在你的情况下，有很多重复，这实际上会比前两个建议慢得多。（而且你需要做更多的工作，因为我不知道任何现有的实现。）

至于散列，几乎任何形式的线性全等散列就足够了：例如 FNV。大多数此类哈希的文档都涉及字符串（的数组 char），但它们往往与任何整数类型一样好。我通常使用类似的东西：

template <typename ForwardIterator>
size_t
hash( ForwardIterator begin, ForwardIterator end )
{
    size_t results = 2166136261U 
    for ( ForwardIterator current = begin; current != end; ++ current ) {
        results = 127 * results + static_cast<size_t>( *current );
    }
    return results;
}

我127作为乘数的选择主要是基于旧系统中的速度：乘以 127 比大多数其他能提供良好结果的值要快得多。（我不知道这是否仍然正确。但是在许多机器上乘法仍然是一个相对较慢的操作，编译器会转换127 * x成类似x << 7 - xif that is faster.) 使用上述算法的分布与FNV，至少对于我测试过的数据集。

score 1 · Accepted Answer

1

基数映射是理想的，但您需要实现它，因为 std 库中没有实现。

于 2013-03-31T10:00:27.337 回答

score 1 · Accepted Answer

计算第一个向量中的值的 CRC 表示。您现在有一个代表您的 30 个值的数字。相对于其余向量，该数字可能是唯一的，但不能保证。

以 CRC 值作为键，以及指向实际向量的指针并将其插入到多映射 {CRC, VectorPointer} 中。

现在为每个剩余的向量计算 CRC，并在 multimap 中查找。

如果找不到，请插入 {CRC, VectorPointer}。如果确实找到了，请遍历匹配项并比较数据元素以确定它是否相同。如果是丢弃新向量。如果不是，则插入 {CRC, VectorPointer}。

冲洗并重复，直到处理完所有 30,000 个向量。

您在多图中拥有可迭代的唯一集。

score 1 · Accepted Answer

假设你有 N 个长度为 K 的向量，其中只有 M 个是唯一的。

哈希+哈希图

您可以在 O(K) 时间内计算每个向量的哈希值，检查哈希图中是否已经有这样的向量，并在 O(1) 时间内插入新向量。对于散列函数，您可以简单地使用不带模数的多项式散列，只需将散列存储为 64 位类型并忽略溢出。实现非常简单，它将在需要 O(M*K) 内存的 O(N*K) 时间内工作。如果您需要先对元素进行排序，时间将是 O(N*K*log(K))

基数树

我认为你不应该在这里使用基数树，因为你仍然需要查看每个向量的每个元素。之所以如此，是因为如果你在一棵树中没有这样的向量，你需要插入它的所有元素，如果你有这样的向量，你需要下到树的叶子上才能看到你以前真的插入过这样的向量。所以渐近线保持不变，但你需要自己实现树，这不是一个好主意:)

看起来很容易表明您至少需要阅读向量的所有元素。之所以如此，是因为在每一刻你都有两种可能性——你之前已经找到了当前向量，你需要读取它的所有元素到最后来识别它，或者你之前没有找到当前向量，你需要读取它的所有元素对它们进行排序和保存。然而，如果向量已经排序，您将只需要读取第一个不匹配的元素。但是让我们想象一下，前 30000 个向量是唯一的，那么无论您将使用什么算法或数据结构，您都需要读取所有其他向量到最后以确定它们不是唯一的。最后我们知道你需要阅读几乎所有的向量到最后:)

如果您的值确实在范围内 (-100, 100) 并且向量中只有 30 个值，您会注意到这样的向量可以保存为四个 64 位整数，因为其中只有8*30 = 240数据位。但这只是另一个想法，我认为使用它的任何实现都不会比散列+散列图更快。

score 0 · Accepted Answer

散列：......我看到的一个缺点是每个向量的每个组件都至少被触摸一次。这似乎已经太多了。

在最坏的情况下，你怎么能比较两个向量而不至少看一次呢？不，真的，如果你有 1,1,1 和 2,2,2 比较/匹配立即结束。但是如果你有 1,2,3 和 1,2,3 呢？

无论如何，这是解决问题的一种方法。实施肯定可以改进。

#include <iostream>
#include <map>
#include <vector>
#include <list>
#include <cstdint>
#include <cstdlib>
#include <ctime>

using namespace std;

const int TotalVectorCount = 1000000;
const int UniqueVectorCount = 30000;
const int VectorLength = 30;

typedef vector<int> Vector;

typedef unsigned long long uint64;

void GenerateRandomVector(Vector& v)
{
  v.reserve(VectorLength);
  // generate 30 values from -100 to +100
  for (int i = 0; i < VectorLength; i++)
    v.push_back(rand() % 201 - 100);
}

bool IdenticalVectors(const Vector& v1, const Vector& v2)
{
  for (int i = 0; i < VectorLength; i++)
    if (v1[i] != v2[i])
      return false;

  return true;
}

// this lets us do "cout << Vector"
ostream& operator<<(ostream& os, const Vector& v)
{
  for (int i = 0; i < VectorLength; i++)
    os << v[i] << ' ';

  return os;
}

uint64 HashVector(const Vector& v)
{
  // this is probably a bad hash function,
  // but it seems to work nonetheless
  uint64 h = 0x7FFFFFFFFFFFFFE7;
  for (int i = 0; i < VectorLength; i++)
    h = h * 0xFFFFFFFFFFFFFFC5 + v[i];
  return h & 0xFFFFFFFFFFFFFFFF;
}

Vector UniqueTestVectors[UniqueVectorCount];

void GenerateUniqueTestVectors()
{
  map<uint64,char> m;
  for (int i = 0; i < UniqueVectorCount; i++)
  {
    for (;;)
    {
      GenerateRandomVector(UniqueTestVectors[i]);
      uint64 h = HashVector(UniqueTestVectors[i]);

      map<uint64,char>::iterator it = m.find(h);

      if (it == m.end())
      {
        m[h] = 0;
        break;
      }
    }
  }
}

bool GetNextVector(Vector& v)
{
  static int count = 0;
  v = UniqueTestVectors[count % UniqueVectorCount];
  return ++count <= TotalVectorCount;
}

int main()
{
  srand(time(0));

  cout << "Generating " << UniqueVectorCount << " unique random vectors..."
       << endl;
  GenerateUniqueTestVectors();

#if 0
  for (int i = 0; i < UniqueVectorCount; i++)
    cout << UniqueTestVectors[i] << endl;
#endif

  cout << "Generating " << TotalVectorCount << " random vectors with only "
       << UniqueVectorCount << " unique..." << endl;

  map<uint64,list<Vector>> TheBigHashTable;

  int uniqCnt = 0;
  int totCnt = 0;

  Vector v;
  while (GetNextVector(v))
  {
    totCnt++;

    uint64 h = HashVector(v);

    map<uint64,list<Vector>>::iterator it = TheBigHashTable.find(h);

    if (it == TheBigHashTable.end())
    {
      // seeing vector with this hash (h) for the first time,
      // insert it into the hash table
      list<Vector> l;
      l.push_back(v);

      TheBigHashTable[h] = l;
      uniqCnt++;
    }
    else
    {
      // we've seen vectors with this hash (h) before,
      // let's see if we've already hashed this vector
      list<Vector>::iterator it;
      bool exists = false;

      for (it = TheBigHashTable[h].begin();
           it != TheBigHashTable[h].end();
           it++)
      {
        if (IdenticalVectors(*it, v))
        {
          // we've hashed this vector before
          exists = true;
          break;
        }
      }

      if (!exists)
      {
        // we haven't hashed this vector before,
        // let's do it now
        TheBigHashTable[h].push_back(v);
        uniqCnt++;
      }
    }
  }

#if 0
  cout << "Unique vectors found:" << endl;
  map<uint64,list<Vector>>::iterator it;
  for (it = TheBigHashTable.begin();
       it != TheBigHashTable.end();
       it++)
  {
    list<Vector>::iterator it2;
    for (it2 = it->second.begin();
         it2 != it->second.end();
         it2++)
      cout << *it2 << endl;
  }
#endif

  cout << "Hashed " << uniqCnt << " unique vectors out of " << totCnt << " total" << endl;

  return 0;
}

使用 12848 kB RAM 在 1.12 秒内输出 ( ideone )：

Generating 30000 unique random vectors...
Generating 1000000 random vectors with only 30000 unique...
Hashed 30000 unique vectors out of 1000000 total

现在，与更少和更短的唯一向量相同，因此可以在控制台中打印它们：

使用 3040 kB 的 RAM 在 0.14 秒内输出（ideone ）：

Generating 10 unique random vectors...
-45 75 1 -71 -83 97 10 -18 89 -10 
-11 60 18 -54 -90 77 19 -90 -7 -31 
-15 -65 -47 88 25 -56 4 39 -20 39 
-64 -14 -37 -13 15 -70 -66 -75 12 73 
-35 -99 32 83 98 -8 59 16 2 -98 
86 37 -63 -62 24 62 -68 78 -50 -38 
17 -64 48 80 -26 -87 61 8 -62 -28 
-70 -47 -27 62 86 -29 -97 44 37 -45 
-4 -28 92 -17 -40 -35 -56 -58 -57 -55 
5 10 -19 -48 -61 5 -35 100 -88 -47 
Generating 1000000 random vectors with only 10 unique...
Unique vectors found:
86 37 -63 -62 24 62 -68 78 -50 -38 
17 -64 48 80 -26 -87 61 8 -62 -28 
5 10 -19 -48 -61 5 -35 100 -88 -47 
-4 -28 92 -17 -40 -35 -56 -58 -57 -55 
-11 60 18 -54 -90 77 19 -90 -7 -31 
-15 -65 -47 88 25 -56 4 39 -20 39 
-35 -99 32 83 98 -8 59 16 2 -98 
-45 75 1 -71 -83 97 10 -18 89 -10 
-64 -14 -37 -13 15 -70 -66 -75 12 73 
-70 -47 -27 62 86 -29 -97 44 37 -45 
Hashed 10 unique vectors out of 1000000 total

c++ - 从整数向量列表中删除重复项的快速方法

5 回答 5

Related

Reference