0

I am starting a discussion, which I hope, will become one place to discuss data loading method using mutators Vs. loading using flat file via 'LOAD DATA INFILE'.

I have been baffled to get enormous performance gain using mutators (using batch size = 1000 or 10000 or 100K et cetera).

My project involved loading close to 400 million rows of social media data into HyperTable to be used for real time analytics. It took me close to 3 days to just load just 1 million row of data (code sample below). Each row is approximately 32 byte. So, in order to avoid taking 2-3 weeks to load this much data, I prepared a flat file with rows and used DATA LOAD INFILE method. Performance gain was amazing. Using this method, loading rate was 368336 cells/sec.

See below for actual snapshot of action:

hypertable> LOAD DATA INFILE "/data/tmp/users.dat" INTO TABLE users;


Loading 7,113,154,337 bytes of input data...                    

0%   10   20   30   40   50   60   70   80   90   100%          
|----|----|----|----|----|----|----|----|----|----|             
***************************************************             
Load complete.                                                  

 Elapsed time:  508.07 s                                       
 Avg key size:  8.92 bytes                                     
  Total cells:  218976067                                      
   Throughput:  430998.80 cells/s                              
      Resends:  2210404                                        


hypertable> LOAD DATA INFILE "/data/tmp/graph.dat" INTO TABLE graph;

Loading 12,693,476,187 bytes of input data...                    

0%   10   20   30   40   50   60   70   80   90   100%           
|----|----|----|----|----|----|----|----|----|----|
***************************************************              
Load complete.                                                   

 Elapsed time:  1189.71 s                                       
 Avg key size:  17.48 bytes                                     
  Total cells:  437952134                                       
   Throughput:  368118.13 cells/s                               
      Resends:  1483209 

Why is performance difference between 2 method is so vast? What's the best way to enhance mutator performance. Sample mutator code is below:

my $batch_size = 1000000; # or 1000 or 10000 make no substantial difference
my $ignore_unknown_cfs = 2;
my $ht = new Hypertable::ThriftClient($master, $port);
my $ns = $ht->namespace_open($namespace);
my $users_mutator = $ht->mutator_open($ns, 'users', $ignore_unknown_cfs, 10);
my $graph_mutator = $ht->mutator_open($ns, 'graph', $ignore_unknown_cfs, 10);
my $keys = new Hypertable::ThriftGen::Key({ row => $row, column_family => $cf, column_qualifier => $cq });
my $cell = new Hypertable::ThriftGen::Cell({key => $keys, value => $val});
$ht->mutator_set_cell($mutator, $cell);
$ht->mutator_flush($mutator);

I would appreciate any input on this? I don't have tremendous amount of HyperTable experience.

Thanks.

4

1 回答 1

2

如果加载一百万行需要三天时间,那么您可能会在每次插入行之后调用 flush(),这不是正确的做法。在我描述如何解决这个问题之前,您的 mutator_open() 参数不太正确。您不需要指定 ignore_unknown_cfs 并且应该为 flush_interval 提供 0,如下所示:

my $users_mutator = $ht->mutator_open($ns, 'users', 0, 0);
my $graph_mutator = $ht->mutator_open($ns, 'graph', 0, 0);

如果您想检查输入数据的消耗量,您应该只调用 mutator_flush()。成功调用 mutator_flush() 意味着插入到该 mutator 的所有数据都已持久地进入数据库。如果您没有检查已消耗了多少输入数据,则无需调用 mutator_flush(),因为当您关闭 mutator 时它会自动刷新。

我看到的代码的下一个性能问题是您正在使用 mutator_set_cell()。您应该使用 mutator_set_cells() 或 mutator_set_cells_as_arrays() 因为每个方法调用都是到 ThriftBroker 的往返,这很昂贵。通过使用 mutator_set_cells_* 方法,您可以在许多单元格上摊销该往返。mutator_set_cells_as_arrays() 方法对于对象构造开销比本机数据类型(例如字符串)大的语言更有效。我不确定 Perl,但你可能想试试看它是否能提高性能。

此外,请务必在完成 mutator 后调用 mutator_close()。

于 2013-07-31T18:52:29.750 回答