3

我正在从事一个 Perl 项目,该项目涉及使用大约 1700 万个键构建散列。这太大了,无法存储在内存中(我的笔记本电脑的内存只能保存大约 1000 万个键)。我知道解决方案是将数据存储在磁盘上,但在实践中我无法执行此操作。这是我尝试过的:

数据库文件

use strict;
use DB_File;
my $libfile = shift;
my %library;
tie %library, "DB_File", "$libfile";
for (my $a = 1; $a < 17000000; a++) {
    # Some code to generate key and value #
    $library{$key} = $value;
}

这给了我一个分段错误:循环中的 11 部分,原因我不明白。

伯克利数据库

use strict; 
use BerkeleyDB;
my $libfile = shift;
my $library = new BerkeleyDB::Hash
    -Filename => $libfile,
    -Flags => DB_CREATE;

for (my $a = 1; $a < 17000000; a++) {
    # Some code to generate key and value #
    $library->db_put($key, $value);
}

这似乎对大约前 1500 万个键很有效,但随后会显着减慢并最终在循环结束时完全冻结。我不认为这是一个内存问题。如果我将循环分成四部分,将它们放在四个单独的程序中,然后按顺序运行它们(每次向数据库添加约 400 万条记录),前三个成功完成,但第四个在数据库大约有 15 条时挂起万把钥匙。所以看起来 BerkeleyDB 可能只能处理大约 1500 万个哈希键???

DBM::深

use strict; 
use DBM::Deep;
my $libfile = shift;
my $library = new DBM::Deep $libfile;

for (my $a = 1; $a < 17000000; a++) {
    # Some code to generate key and value #
    $library->put($key => $value);
}

从初步测试来看,这似乎工作正常,但它真的很慢:每千个键大约 5 秒,或者运行整个循环约 22 小时。如果可能的话,我宁愿避免这种情况。

I'd be very grateful for suggestions on troubleshooting one of these packages, or ideas about other options for accomplishing the same thing.

UPDATE

4

2 回答 2

2

Switching to btree may improve performance for HUGE BerkeleyDB accessed in "key sorted mode". It reduces number of disk I/O operations required.

Case study: In one case reported in news:comp.mail.sendmail I remember HUGE BerkeleyDB creation time was reduced from a few hours for hash to 20 minutes for btree with "key sorted" appends. It was too long anyway so the guy decided to switch to soft capable to access SQL database directly avoiding needs for SQL database "dumps" to BerkeleyDB. (virtusertable, sendmail->postfix)

于 2014-02-14T17:22:59.513 回答
0

You can try PostgreSQL.

First create a table with two column, key and value, varchar will be fine,

then, instead of inserting each one, use Pg::BulkCopy to copy data to the database.

I recommend inserting not more than 100 MB at a time, because, when your COPY command FAIL, the PostgreSQL will keep those lines that was before inserted on the disk, and it will only remove it if you VACUUM FULL the table. (one time I processed lots of 5GB and a couple of it fail on some constraint on almost ending and disk never got back on rollback..)

ps: you can use DBD::Pg directly too: https://metacpan.org/pod/DBD::Pg#COPY-support

After all copy finish, you can create a index on key and if you need more speed, use Redis or memcached with MAXMEMORY POLICY

于 2014-02-14T18:36:58.587 回答