4

我经常处理文本文件,以“SQL 方式”将一个文件与另一个文件进行比较。

DBD::CSV显然是一个不错的选择,因为我可以在文本表上使用 SQL 语法的强大功能。但是,我处理的是巨大的文本文件,DBD::CSV在性能方面毫无用处。

因此,我开始编写一个将 csv 文件转换为 sqlite DB 的模块,然后返回DBI::sqlite我可以使用的对象。问题是,将文本文件转换为 sqlite 表也不是很有效,因为我无法从 perl 运行 sqlite 命令行来快速加载 CSV 文件(使用 .load)。所以我必须Insert into根据文本表创建一个巨大的字符串并执行它(逐行执行插入在性能方面非常低效,所以我更喜欢执行一个大插入)。我愿意避免这种情况,寻找一种使用 perl 将 csv 加载到 sqlite 的单行器。

还有一件事,我使用以下函数很好地执行和打印 SQL 查询:

sub sql_command {
my ($self,$str) = @_;
my $s = $self->{_db}->prepare($str) or die $!;
$s->execute() or die $!;
my $table;
push @$table, [ map { defined $_ ? $_ : "undef" } @{$s->{'NAME'}}];
while(my $row = $s->fetch) {
    push @$table, [ map{ defined $_ ? $_ : "undef" }@$row ];
}
box_format($table);
return box_format($table);;
}


sub box_format {
my $table = shift;
my $n_cols = scalar @{$table->[0]};

my $tb = Text::Table->new(\'| ', '', (\' | ','')x($n_cols-1), \' |+');
$tb->load(@$table);
my $rule = $tb->rule(qw/- +/);
my @rows = $tb->body();
return $rule, shift @rows, $rule, @rows, $rule
    if @rows;
}

sql_command子程序大约需要 1 分钟才能执行(在 6.5 MB 文件上),在我看来,这比我预期的要长得多。有没有人有更有效的解决方案?

谢谢!

4

1 回答 1

6

Text::CSV_XS is extremely fast, using that to handle the CSV should take care of that side of the performance problem.

There should be no need for special bulk insert code to make DBD::SQLite performant. An insert statement with bind parameters is very fast. The main trick is to turn off AutoCommit in DBI and do all the inserts in a single transaction.

use v5.10;
use strict;
use warnings;
use autodie;

use Text::CSV_XS;
use DBI;

my $dbh = DBI->connect(
    "dbi:SQLite:dbname=csvtest.sqlite", "", "",
    {
        RaiseError => 1, AutoCommit => 0
    }
);

$dbh->do("DROP TABLE IF EXISTS test");

$dbh->do(<<'SQL');
CREATE TABLE test (
    name        VARCHAR,
    num1        INT,
    num2        INT,
    thing       VARCHAR,
    num3        INT,
    stuff       VARCHAR
)
SQL

# Using bind parameters avoids having to recompile the statement every time
my $sth = $dbh->prepare(<<'SQL');
INSERT INTO test
       (name, num1, num2, thing, num3, stuff)
VALUES (?,    ?,    ?,    ?,     ?,    ?    )
SQL

my $csv = Text::CSV_XS->new or die;
open my $fh, "<", "test.csv";
while(my $row = $csv->getline($fh)) {
    $sth->execute(@$row);
}
$csv->eof;
close $fh;

$sth->finish;    
$dbh->commit;

This ran through a 5.7M CSV file in 1.5 seconds on my Macbook. The file was filled with 70,000 lines of...

"foo",23,42,"waelkadjflkajdlfj aldkfjal dfjl",99,"wakljdlakfjl adfkjlakdjflakjdlfkj"

It might be possible to make it a little faster using bind columns, but in my testing it slowed things down.

于 2013-03-11T11:24:34.997 回答