2

I have a set of strings that is modified inside a loop of 25k iterations. It's empty at the beginning, but 0-200 strings are randomly added or removed from it in each cycle. At the end, the set contains about 80k strings.
I want to make it resumable. The set should be saved to disk after each cycle and be loaded on resume.
What library can I use? The amount of raw data is ~16M, but the changes are usually small. I don't want it to rewrite the whole store on each iteration.

Since the strings are paths, I'm thinking of storing them in a log file like this:

+a
+b
commit
-b
+d
commit

In the beginning the file is loaded into a hash and then compacted. If there's no commit line in the end, the last block is not taken into account.

4

2 回答 2

1

Storable 包为您的Perl 数据结构(SCALAR、ARRAY、HASH 或 REF 对象)带来持久性,即任何可以方便地存储到磁盘并在以后检索的东西。

于 2013-05-02T08:51:59.710 回答
0

我决定收起重炮,写点简单的东西:

package LoL::IMadeADb;

sub new {
  my $self;
  ( my $class, $self->{dbname} )  = @_;
  # open for read, then write. create if not exist
  #msg "open $self->{dbname}";
  open(my $fd, "+>>", $self->{dbname}) or die "cannot open < $self->{dbname}: $!";
  seek($fd, 0, 0);
  $self->{fd} = $fd;
  #msg "opened";
  $self->{paths} = {};
  my $href = $self->{paths};

  $self->{nlines} = 0;
  my $lastcommit = 0;
  my ( $c, $rest );
  while(defined($c = getc($fd)) && substr(($rest = <$fd>), -1) eq "\n") {
    $self->{nlines}++;
    chomp($rest);
    if ($c eq "c") {
      $lastcommit = tell($fd);
      #msg "lastcommit: " . $lastcommit;
    } elsif ($c eq "+") {
      $href->{$rest} = undef;
    } elsif ($c eq "-") {
      delete $href->{$rest};
    }
    #msg "line: '" . $c . $rest . "'";
  }
  if ($lastcommit < tell($fd)) {
    print STDERR "rolling back incomplete file: " . $self->{dbname} . "\n";
    seek($fd, $lastcommit, 0);
    while(defined($c = getc($fd)) && substr(($rest = <$fd>), -1) eq "\n") {
      $self->{nlines}--;
      chomp($rest);
      if ($c eq "+") {
        delete $href->{$rest};
      } else {
        $href->{$rest} = undef;
      }
    }
    truncate($fd, $lastcommit) or die "cannot truncate $self->{dbname}: $!";
    print STDERR "rolling back incomplete file; done\n";
  }
  #msg "entries = " . (keys( %{ $href })+0) . ", nlines = " . $self->{nlines} . "\n";
  bless $self, $class
}

sub add {
  my ( $self , $path ) = @_;
  if (!exists $self->{paths}{$path}) {
    $self->{paths}{$path} = undef;
    print { $self->{fd} } "+" . $path . "\n";
    $self->{nlines}++;
    $self->{changed} = 1;
  }
  undef
}

sub remove {
  my ( $self , $path ) = @_;
  if (exists $self->{paths}{$path}) {
    delete $self->{paths}{$path};
    print { $self->{fd} } "-" . $path . "\n";
    $self->{nlines}++;
    $self->{changed} = 1;
  }
  undef
}

sub save {
  my ( $self ) = @_;
  return undef unless $self->{changed};
  my $fd = $self->{fd};
  my @keys = keys %{$self->{paths}};
  if ( $self->{nlines} - @keys > 5000 ) {
    #msg "compacting";
    close($fd);
    my $bkpdir = dirname($self->{dbname});
    ($fd, my $bkpname) = tempfile(DIR => $bkpdir , SUFFIX => ".tmp" ) or die "cannot create backup file in: $bkpdir: $!";
    $self->{nlines} = 1;
    for (@keys) {
      print { $fd } "+" . $_ . "\n" or die "cannot write backup file: $!";
      $self->{nlines}++;
    }
    print { $fd } "c\n";
    close($fd);
    move($bkpname, $self->{dbname})
      or die "cannot rename " . $bkpname . " => " . $self->{dbname} . ": $!";
    open($self->{fd}, ">>", $self->{dbname}) or die "cannot open < $self->{dbname}: $!";
  } else {
    print { $fd } "c\n";
    $self->{nlines}++;

    # flush:
    my $previous_default = select($fd);
    $| ++;
    $| --;
    select($previous_default);
  }
  $self->{changed} = 0;
  #print "entries = " . (@keys+0) . ", nlines = " . $self->{nlines} . "\n";
  undef
}
1;
于 2013-05-02T17:09:35.423 回答