8

我遇到了一种情况,我必须将一组记录批量处理到数据库中。我想知道如何使用标准算法来实现这一点。

给定 10002 条记录,我想将其划分为 100 条记录的 bin 进行处理,其余为 2 的 bin。

希望下面的代码能更好地说明我要完成的工作。我完全愿意接受涉及迭代器、lambda 的任何现代 C++ 乐趣的解决方案。

#include <cassert>
#include <vector>
#include <algorithm>

template< typename T >
std::vector< std::vector< T > > chunk( std::vector<T> const& container, size_t chunk_size )
{
  return std::vector< std::vector< T > >();
}

int main()
{
  int i = 0;
  const size_t test_size = 11;
  std::vector<int> container(test_size);
  std::generate_n( std::begin(container), test_size, [&i](){ return ++i; } );

  auto chunks = chunk( container, 3 );

  assert( chunks.size() == 4 && "should be four chunks" );
  assert( chunks[0].size() == 3 && "first several chunks should have the ideal chunk size" );
  assert( chunks.back().size() == 2 && "last chunk should have the remaining 2 elements" );

  return 0;
}
4

3 回答 3

7

在并行化的上下文中,这种范围的划分很常见,我发现定义范围的概念是有用的,即一对(从,到)在各种例程和线程之间传递的最小单元过程。

这比为每个部分复制整个子向量要好,因为它需要的内存空间要少得多。它也比只维护单个结束迭代器更实用,因为每个范围都可以按原样传递给线程——当它是第一个或最后一个部分等时没有特殊情况。

考虑到这一点,以下是我发现运行良好的例程的简化版本,并且都是现代 C++11:

#include <cassert>
#include <vector>
#include <utility>
#include <algorithm>
#include <cstdint>

template <typename It>
std::vector<std::pair<It,It>>
  chunk(It range_from, It range_to, const std::ptrdiff_t num)
{
  /* Aliases, to make the rest of the code more readable. */
  using std::vector;
  using std::pair;
  using std::make_pair;
  using std::distance;
  using diff_t = std::ptrdiff_t;

  /* Total item number and portion size. */
  const diff_t total
  { distance(range_from,range_to) };
  const diff_t portion
  { total / num };

  vector<pair<It,It>> chunks(num);

  It portion_end
  { range_from };

  /* Use the 'generate' algorithm to create portions. */    
  std::generate(begin(chunks),end(chunks),[&portion_end,portion]()
        {
          It portion_start
          { portion_end };

          portion_end += portion;
          return make_pair(portion_start,portion_end);
        });

  /* The last portion's end must always be 'range_to'. */    
  chunks.back().second = range_to;

  return chunks;
}

int main()
{
  using std::distance;

  int i = 0;
  const size_t test_size = 11;
  std::vector<int> container(test_size);
  std::generate_n( std::begin(container), test_size, [&i](){ return ++i; } );

  /* This is how it's used: */    
  auto chunks = chunk(begin(container),end(container),3);

  assert( chunks.size() == 3 && "should be three chunks" );
  assert( distance(chunks[0].first,chunks[0].second) == 3 && "first several chunks should have the ideal chunk size" );
  assert( distance(chunks[2].first,chunks[2].second) == 5 && "last chunk should have 5 elements" );

  return 0;
}

它的工作方式与您建议的略有不同:部分大小始终向下舍入,因此在您的示例中您只得到 3 个部分,最后部分比其他部分略大(而不是略小)。这可以很容易地修改(我认为这并不重要,因为通常部分的数量远小于工作项的总数)。


评论。在我自己对范围相关模式的使用中,很快发现实际存储整数(每个表示与 的距离.begin())而不是迭代器通常更好。原因是这些整数和实际迭代器之间的转换是一种快速且无害的操作,无论您需要iterator还是const_iterator. 然而,当您存储迭代器时,您需要一劳永逸地决定是否使用iteratoror const_iterator,这可能会很痛苦。

于 2013-01-09T02:13:35.417 回答
5

The problem seems to be a variation on std::for_each, where the "each" you want to operate on is an interval of your collection. Thus you would prefer to write a lambda (or function) that takes two iterators defining the start and end of each interval and pass that lambda/function to your algorithm.

Here's what I came up with...

// (Headers omitted)

template < typename Iterator >
void for_each_interval(
    Iterator begin
  , Iterator end
  , size_t interval_size
  , std::function<void( Iterator, Iterator )> operation )
{
  auto to = begin;

  while ( to != end )
  {
    auto from = to;

    auto counter = interval_size;
    while ( counter > 0 && to != end )
    {
      ++to;
      --counter;
    }

    operation( from, to );
  }
}

(I wish that std::advance would take care of the inner loop that uses counter to increment to, but unfortunately it blindly steps beyond the end [I'm tempted to write my own smart_advance template to encapsulate this]. If that would work, it would reduce the amount of code by about half!)

Now for some code to test it...

// (Headers omitted)

int main( int argc, char* argv[] )
{
  // Some test data
  int foo[10] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };
  std::vector<int> my_data( foo, foo + 10 );
  size_t const interval = 3;

  typedef decltype( my_data.begin() ) iter_t;
  for_each_interval<iter_t>( my_data.begin(), my_data.end(), interval,
    []( iter_t from, iter_t to )
    {
      std::cout << "Interval:";
      std::for_each( from, to,
        [&]( int val )
        {
          std::cout << " " << val;
        } );
      std::cout << std::endl;
    } );
}

This produces the following output, which I think represents what you want:

Interval: 0 1 2
Interval: 3 4 5
Interval: 6 7 8
Interval: 9
于 2013-01-09T18:58:33.810 回答
0

实现略有不同,但对迭代器使用范围操作。我也在考虑使用 std::partition 函数的实现。

#include <iostream>
#include <cassert>
#include <vector>
#include <algorithm>

template< typename Iterator >
void sized_partition( Iterator from, Iterator to, std::ptrdiff_t partition_size, std::function<void(Iterator partition_begin, Iterator partition_end)> range_operation )
{
  auto partition_end = from;
  while( partition_end != to )
  {
    while( partition_end != to && std::distance( from, partition_end ) < partition_size )
      ++partition_end;

    range_operation( from, partition_end );
    from = partition_end;
  }
}

int main()
{
  int i = 0;
  const size_t test_size = 11;
  std::vector<int> container(test_size);
  typedef std::vector<int>::iterator int_iterator;
  std::generate_n( std::begin(container), test_size, [&i](){ return ++i; } );

  sized_partition<int_iterator>( container.begin(), container.end(), 3, []( int_iterator start_partition, int_iterator end_partition )
  {
    std::cout << "Begin: ";
    std::copy( start_partition, end_partition, std::ostream_iterator<int>(std::cout, ", ") );
    std::cout << " End" << std::endl;
  });

  return 0;
}
于 2013-01-09T18:21:56.960 回答