c++ - Performance degradation due to default initialisation of elements in standard containers

Question

(Yes, I know there is a question with almost the same title, but the answer was not satisfactory, see below)

EDIT Sorry, the original question didn't use compiler optimization. This is now fixed, but to avoid trivial optimization and to come closer to my actual use case, the test has been split into two compilation units.

The fact that the constructor of std::vector<> has linear complexity is a nuisance when it comes to performance-critical applications. Consider this simple code

// compilation unit 1:
void set_v0(type*x, size_t n)
{
  for(size_t i=0; i<n; ++i)
    x[i] = simple_function(i);
}

// compilation unit 2:
std::vector<type> x(n);                     // default initialisation is wasteful
set_v0(x.data(),n);                         // over-writes initial values

when a significant amount of time is wasted by constructing x. The conventional way around this, as explored by this question, seems to be to merely reserve the storage and use push_back() to fill in the data:

// compilation unit 1:
void set_v1(std::vector<type>&x, size_t n)
{
  x.reserve(n);
  for(size_t i=0; i<n; ++i)
    x.push_back(simple_function(i));
}

// compilation unit 2:
std::vector<type> x(); x.reserve(n);        // no initialisation
set_v1(x,n);                                // using push_back()

However, as indicated by my comment, the push_back() is inherently slow, making this second approach actually slower than the first one for sufficiently simply constructible objects, such as size_ts, when for

simple_function = [](size_t i) { return i; };

I get the following timings (using gcc 4.8 with -O3; clang 3.2 produced ~10% slower code)

timing vector::vector(n) + set_v0();
n=10000 time: 3.9e-05 sec
n=100000 time: 0.00037 sec
n=1000000 time: 0.003678 sec
n=10000000 time: 0.03565 sec
n=100000000 time: 0.373275 sec

timing vector::vector() + vector::reserve(n) + set_v1();
n=10000 time: 1.9e-05 sec
n=100000 time: 0.00018 sec
n=1000000 time: 0.00177 sec
n=10000000 time: 0.020829 sec
n=100000000 time: 0.435393 sec

The speed-up actually possible if one could elide the default construction of elements can be estimated by the following cheating version

// compilation unit 2
std::vector<type> x; x.reserve(n);          // no initialisation
set_v0(x,n);                                // error: write beyond end of vector
                                            // note: vector::size() == 0

when we get

timing vector::vector + vector::reserve(n) + set_v0();          (CHEATING)
n=10000 time: 8e-06 sec
n=100000 time: 7.2e-05 sec
n=1000000 time: 0.000776 sec
n=10000000 time: 0.01119 sec
n=100000000 time: 0.298024 sec

So, my first question: Is there any legal way to use a standard library container which would give these latter timings? Or do I have to resort to manage the memory myself?

Now, what I really want, is to use multi-threading to fill in the container. The naive code (using openMP in this example for simplicity, which excludes clang for the moment)

// compilation unit 1
void set_v0(type*x, size_t n)
{
#pragma omp for                       // only difference to set_v0() from above 
  for(size_t i=0; i<n; ++i)
    x[i] = simple_function(i);
}

// compilation unit 2:
std::vector<type> x(n);               // default initialisation not mutli-threaded
#pragma omp parallel
set_v0(x,n);                          // over-writes initial values in parallel

now suffers from the fact that the default initialization of all elements is not multi-threaded, resulting in an potentially serious performance degradation. Here are the timings for set_omp_v0() and a equivalent cheating method (using my macbook's intel i7 chip with 4 cores, 8 hyperthreads):

timing std::vector::vector(n) + omp parallel set_v0()
n=10000 time: 0.000389 sec
n=100000 time: 0.000226 sec
n=1000000 time: 0.001406 sec
n=10000000 time: 0.019833 sec
n=100000000 time: 0.35531 sec

timing vector::vector + vector::reserve(n) + omp parallel set_v0(); (CHEATING)
n=10000 time: 0.000222 sec
n=100000 time: 0.000243 sec
n=1000000 time: 0.000793 sec
n=10000000 time: 0.008952 sec
n=100000000 time: 0.089619 sec

Note that the cheat version is ~3.3 times faster than the serial cheat version, roughly as expected, but the standard version is not.

So, my second question: Is there any legal way to use a standard library container which would give these latter timings in multi-threaded situations?

PS. I found this question, where std::vector is tricked into avoiding the default initialization by providing it with a uninitialized_allocator. This is no longer standard compliant, but works very well for my test case (see my own answer below and this question for details).

score 12 · Accepted Answer

使用 g++ 4.5，通过使用生成器直接构造，我能够实现从 v0（1.0s 到 0.8s）的运行时间减少大约 20% 并且从 v1 的 0.95s 到 0.8s 稍微减少一点：

struct Generator : public std::iterator<std::forward_iterator_tag, int>
{
    explicit Generator(int start) : value_(start) { }
    void operator++() { ++value_; }
    int operator*() const { return value_; }

    bool operator!=(Generator other) const { return value_ != other.value_; }

    int value_;
};

int main()
{
    const int n = 100000000;
    std::vector<int> v(Generator(0), Generator(n));

    return 0;
}

score 12 · Accepted Answer

好的，这是我问这个问题后学到的。

Q1（是否有任何合法的方法可以使用标准库容器来提供后面的时间？）在某种程度上是的，如 Mark 和 Evgeny 的回答所示。向构造函数提供生成器的方法std::vector省略了默认构造。

Q2（是否有任何合法的方法可以使用标准库容器来在多线程情况下提供这些后一种计时？）不，我不这么认为。原因是在构造时，任何符合标准的容器都必须初始化其元素，以确保对元素析构函数的调用（在容器的销毁或调整大小时）格式正确。由于 std 库容器不支持使用多线程来构造它们的元素，所以这里不能复制Q1的技巧，所以我们不能省略默认构造。

因此，如果我们想使用 C++ 进行高性能计算，在管理大量数据时，我们的选择会受到一定限制。我们可以

1声明一个容器对象，并在同一个编译单元中立即填充它（同时），当编译器希望优化构造时的初始化时；

2诉诸new[]anddelete[]甚至malloc()and free()，当所有的内存管理，在后一种情况下，元素的构造是我们的责任，我们对 C++ 标准库的潜在使用非常有限。

3技巧 a不使用省略默认构造std::vector的自定义来初始化其元素。unitialised_allocator遵循Jared Hoberock的想法，这样的分配器可能如下所示（另请参见此处）：

// based on a design by Jared Hoberock
// edited (Walter) 10-May-2013, 23-Apr-2014
template<typename T, typename base_allocator = std::allocator<T> >
struct uninitialised_allocator
  : base_allocator
{
  static_assert(std::is_same<T,typename base_allocator::value_type>::value,
                "allocator::value_type mismatch");

  template<typename U>
  using base_t =
    typename std::allocator_traits<base_allocator>::template rebind_alloc<U>;

  // rebind to base_t<U> for all U!=T: we won't leave other types uninitialised!
  template<typename U>
  struct rebind
  {
    typedef typename
    std::conditional<std::is_same<T,U>::value,
                     uninitialised_allocator, base_t<U> >::type other; 
  }

  // elide trivial default construction of objects of type T only
  template<typename U>
  typename std::enable_if<std::is_same<T,U>::value && 
                          std::is_trivially_default_constructible<U>::value>::type
  construct(U*) {}

  // elide trivial default destruction of objects of type T only
  template<typename U>
  typename std::enable_if<std::is_same<T,U>::value && 
                          std::is_trivially_destructible<U>::value>::type
  destroy(U*) {}

  // forward everything else to the base
  using base_allocator::construct;
  using base_allocator::destroy;
};

然后unitialised_vector<>可以像这样定义模板：

template<typename T, typename base_allocator = std::allocator<T>>
using uninitialised_vector = std::vector<T,uninitialised_allocator<T,base_allocator>>;

我们仍然可以使用几乎所有标准库的功能。虽然必须说，uninitialised_allocator因此暗示unitialised_vector不符合标准，因为它的元素不是默认构造的（例如 avector<int>不会0在构造后全部具有）。

当使用这个工具来解决我的小测试问题时，我得到了很好的结果：

timing vector::vector(n) + set_v0();
n=10000 time: 3.7e-05 sec
n=100000 time: 0.000334 sec
n=1000000 time: 0.002926 sec
n=10000000 time: 0.028649 sec
n=100000000 time: 0.293433 sec

timing vector::vector() + vector::reserve() + set_v1();
n=10000 time: 2e-05 sec
n=100000 time: 0.000178 sec
n=1000000 time: 0.001781 sec
n=10000000 time: 0.020922 sec
n=100000000 time: 0.428243 sec

timing vector::vector() + vector::reserve() + set_v0();
n=10000 time: 9e-06 sec
n=100000 time: 7.3e-05 sec
n=1000000 time: 0.000821 sec
n=10000000 time: 0.011685 sec
n=100000000 time: 0.291055 sec

timing vector::vector(n) + omp parllel set_v0();
n=10000 time: 0.00044 sec
n=100000 time: 0.000183 sec
n=1000000 time: 0.000793 sec
n=10000000 time: 0.00892 sec
n=100000000 time: 0.088051 sec

timing vector::vector() + vector::reserve() + omp parallel set_v0();
n=10000 time: 0.000192 sec
n=100000 time: 0.000202 sec
n=1000000 time: 0.00067 sec
n=10000000 time: 0.008596 sec
n=100000000 time: 0.088045 sec

当作弊版本和“合法”版本之间不再有区别时。

score 6 · Accepted Answer

`boost::transformed`

对于单线程版本，您可以使用boost::transformed. 它有：

返回范围类别：rng的范围类别。

这意味着，如果您愿意Random Access Range，boost::transformed它会返回Random Access Range，这将允许vector' 的构造函数预先分配所需的内存量。

您可以按如下方式使用它：

const auto &gen = irange(0,1<<10) | transformed([](int x)
{
    return exp(Value{x});
});
vector<Value> v(begin(gen),end(gen));

现场演示

#define BOOST_RESULT_OF_USE_DECLTYPE 
#include <boost/range/adaptor/transformed.hpp>
#include <boost/container/vector.hpp>
#include <boost/range/irange.hpp>
#include <boost/progress.hpp>
#include <boost/range.hpp>
#include <iterator>
#include <iostream>
#include <ostream>
#include <string>
#include <vector>
#include <array>


using namespace std;
using namespace boost;
using namespace adaptors;

#define let const auto&

template<typename T>
void dazzle_optimizer(T &t)
{
    auto volatile dummy = &t; (void)dummy;
}

// _______________________________________ //

using Value = array<int,1 << 16>;
using Vector = container::vector<Value>;

let transformer = [](int x)
{
    return Value{{x}};
};
let indicies = irange(0,1<<10);

// _______________________________________ //

void random_access()
{
    let gen = indicies | transformed(transformer);
    Vector v(boost::begin(gen), boost::end(gen));
    dazzle_optimizer(v);
}

template<bool reserve>
void single_pass()
{
    Vector v;
    if(reserve)
        v.reserve(size(indicies));
    for(let i : indicies)
        v.push_back(transformer(i));
    dazzle_optimizer(v);
}

void cheating()
{
    Vector v;
    v.reserve(size(indicies));
    for(let i : indicies)
        v[i]=transformer(i);
    dazzle_optimizer(v);
}

// _______________________________________ //

int main()
{
    struct
    {
        const char *name;
        void (*fun)();
    } const tests [] =
    {
        {"single_pass, no reserve",&single_pass<false>},
        {"single_pass, reserve",&single_pass<true>},
        {"cheating reserve",&cheating},
        {"random_access",&random_access}
    };
    for(let i : irange(0,3))
        for(let test : tests)
            progress_timer(), // LWS does not support auto_cpu_timer
                (void)i,
                test.fun(),
                cout << test.name << endl;

}

score 1 · Accepted Answer

在这种情况下，我实际上会建议推出您自己的容器或寻找替代品，因为按照我的看法，您的固有问题不在于标准容器的默认构造元素。这是尝试为容量可以在构造时确定的容器使用可变容量容器。

没有标准库不必要地默认构造元素的情况。vector仅对其填充构造函数和这样做resize，这两者在概念上都是通用容器所必需的，因为它们的目的是调整容器的大小以包含有效元素。同时，这样做很简单：

T* mem = static_cast<T*>(malloc(num * sizeof(T)));
for (int j=0; j < num; ++j)
     new (mem + j) T(...); // meaningfully construct T
...
for (int j=0; j < num; ++j)
     mem[j].~T();         // destroy T
free(mem);

...然后从上面的代码中构建一个异常安全的符合 RAII 的容器。这就是我在您的情况下的建议，因为如果默认构造浪费到足以在填充构造函数上下文中不可忽略到替代reserve和push_back或emplace_back同样不足的程度，那么即使是容器也可能将其容量和大小视为变量是不可忽略的开销，此时您完全有理由寻找其他东西，包括从上面的概念中滚动您自己的东西。

标准库非常高效，因为它在苹果与苹果的比较中难以匹配，但在这种情况下，您需要的是橙子而不是苹果。在这种情况下，直接伸手去拿一个橙子比试图把一个苹果变成一个橙子变得更容易了。

c++ - Performance degradation due to default initialisation of elements in standard containers

4 回答 4

boost::transformed

现场演示

Related

Reference

`boost::transformed`