c++ - c++ : Vector of references 的替代方案，以避免复制大数据

Question

我花了一些时间寻找答案，但没有找到任何令人满意的答案。

只是对一些更有经验的 C++ 人如何解决这类问题感兴趣，因为现在我正在做更多与生产相关的编码而不是原型设计。

假设您有一个类，其中包含一个 unordered_map（hashmap），其中包含大量数据，例如 500Mb。您想编写一个以有效方式返回该数据的某个子集的访问器。

采取以下方式，其中 BigData 是一些存储适量数据的类。

Class A
{
   private:
      unordered_map<string, BigData> m_map;   // lots of data

   public:

    vector<BigData>   get10BestItems()
    {
        vector<BigData>  results;
        for ( ........  // iterate over m_map and add 10 best items to results
        // ... 
       return results;
    }

};

访问器 get10BestItems 在此代码中效率不高，因为它首先将项目复制到结果向量，然后在函数返回时复制结果向量（从函数堆栈复制）。

由于各种原因，您不能在 c__ 中拥有引用向量，这将是显而易见的答案：

vector<BigData&> results;     // vector can't contain references.

您可以在堆上创建结果向量并传递对它的引用：

vector<BigData>&   get10BestItems()    // returns a reference to the vector
    {
        vector<BigData>  results = new vector<BigData>;   // generate on heap
        for ( ........  // iterate over m_map and add 10 best items to results
            // ... 
       return results;   // can return the reference 
    }

但是，如果您不小心，您将遇到内存泄漏问题。它也很慢（堆内存）并且仍然将数据从映射复制到向量。

所以我们可以回顾一下 c 风格的编码，只使用指针：

vector<BigData*>   get10BestItems()    // returns a vector of pointers
    {
        vector<BigData*>  results ; // vectors of pointers
        for ( ........  // iterate over m_map and add 10 best items to results
        // ... 
       return results;  
    }

但大多数消息来源说，除非绝对必要，否则不要使用指针。有使用 smart_pointers 和 boost ptr_vector 的选项，但我宁愿尽量避免这些。

我不认为地图将是静态的，所以我不太担心错误的指针。如果代码必须不同来处理指针，这只是一个问题。从风格上讲，这并不令人愉快：

const BigData&   getTheBestItem()    // returns a const reference
{
       string bestID;
       for ( ........  // iterate over m_map, find bestID
       // ... 
       return m_map[bestID] ; // return a referencr to the best item
}


vector<BigData*>   get10BestItems()    // returns a vector of pointers
{    
        vector<BigData*>  results ; // vectors of pointers
        for_each ........  // iterate over m_map and add 10 best items to results
        // ... 
       return results;  
 };

例如，如果您想要一个项目，那么返回参考很容易。

最后的选择是简单地将哈希映射公开并返回一个键向量（在这种情况下为字符串）：

Class A
{
      public:

         unordered_map<string, BigData> m_map;   // lots of data



    vector<string>   get10BestItemKeys()
    {
        vector<string>  results;
        for (........  // iterate over m_map and add 10 best KEYS to results
        // ... 
       return results;
    }

};



A aTest;
... // load data to map

vector <string> best10 =  aTest.get10BestItemKeys();
for ( .... // iterate over all KEYs in best10
{
    aTest.m_map.find(KEY);  // do something with item.
    // ...
}

什么是最好的解决方案？速度很重要，但我想要易于开发和安全的编程实践。

score 3 · Accepted Answer

如果地图是恒定的，我只会使用指针向量。如果你想避免数据被改变，你总是可以返回 const 指针。

引用在它们起作用时非常有用，但我们仍然有指针是有原因的（对我来说，这属于“必要”的范畴）。

score 2 · Accepted Answer

我会做类似于以下的事情：

Class A
{
private:
    unordered_map<string, BigData> m_map;   // lots of data
    vector<BigData*> best10;

public:
    A()
        : best10(10)
    {
        // Other constructor stuff
    }

    const vector<BigData*>&   get10BestItems()
    {
       // Set best10[0] through best10[9] with the pointers to the best 10
       return best10;
    }

};

注意几点：

向量不是每次都重新分配，而是作为常量引用返回，所以当你调用get10BestItems.
在这种情况下，指针就可以了。您阅读的有关避免指针的内容可能与堆分配有关，在这种情况下std::unique_ptr或std::shared_ptr现在是首选。

score 1 · Accepted Answer

boost::ref这对我来说听起来像是一份工作。只需稍微更改您的原始代码：

typedef std::vector<boost::ref<BigData> > BestItems;

BestItems  get10BestItems()
    {
        BestItems  results;
        for ( ........  // iterate over m_map and add 10 best items to results
        // ... 
       return results;
    }

现在，您理论上只返回对返回向量中每个项目的引用，使其复制起来既小又便宜（如果编译器无法完全优化返回副本）。

score 0 · Accepted Answer

我通常使用boost::range，我发现它在很多情况下都是无价的，尤其是你描述的那个。

您可以保留范围对象并对其进行迭代等。

但我应该提一下，如果您在获取范围和使用范围之间添加/删除对象，我不知道会发生什么，因此您可能需要在使用它之前检查一下。

c++ - c++ : Vector of references 的替代方案，以避免复制大数据

4 回答 4

Related

Reference