perl - perl 读取 seek tell 和 text 文件。读取的字节太多。图层和换行符处理

Question

我有一个 Perl 脚本，它分析一个文本文件（可以是 UNIX 或 Windows 行结尾），当它找到感兴趣的东西时存储文件偏移量。

open(my $fh, $filename);
my $groups;
my %hash;
while(<$fh>) {
   if($_ =~ /interesting/ ) {
      $hash{$groups++}{offset} = tell($fh);
   }
}
close $fh;

然后稍后在脚本中，我想生成文本文件的“n”个副本，但在每个“有趣”区域都有附加内容。为了实现这一点，我遍历偏移量的散列：

foreach my $group (keys %hash) {
   my $href = $hash{$group};
   my $offset = $href->{offset};

   my $top;
   open( $fh, $file);
   read( $fh, $top, $offset);
   my $bottom = do{local $/; <$fh>};
   close $fh;

   $href->{modified} = $top . "Hello World\n" . $bottom;
}

问题是读取命令正在读取太多字节。我怀疑这是一个行尾问题，因为输出的字节数（字符？）与行号相同。使用记事本++，该tell()命令将实际偏移量返回到兴趣点，但使用该偏移量值read()返回超过兴趣点的字符。

我尝试binmode($fh)在. 这确实在文本文件中找到了正确的位置，但随后我得到 (CR + CRLF) 输出并且文本文件充满了双回车符。open()read()

我玩过层：crlf，：bytes，但没有任何改进。

有点卡住了！

score 0 · Accepted Answer

以连续整数范围作为键的散列应该是一个数组。
您正在为每次出现的情况存储整个文件的副本/interesting/

听起来你需要做的是这个

open(my $fh, $filename);
while (<$fh>) {
  print;
  print "Hello World\n" if /interesting/;
}

score 0 · Accepted Answer

来自perldoc -f read：

read FILEHANDLE,SCALAR,LENGTH,OFFSET
read FILEHANDLE,SCALAR,LENGTH

所以，当你这样做时：

read( $fh, $top, $offset);

你$offset的实际上是一个长度。决定你需要阅读多少个字符。 read不考虑行尾，它读取指定的字节数。

如果您想阅读一行，请不要使用read，请使用：

seek($fh, $offset, 0);
$top = <$fh>;

您的文件是否充满了两个换行符，或者您是否添加了一个带有print语句的行？

score 0 · Accepted Answer

当输入文件不是很大时，我处理此问题的标准方法是将文件插入并规范化行尾，将每一行存储为数组元素。我有时必须在同一批文件中处理 Windows ( CR+ LF) 和 UNIX ( LFonly) 和 Mac ( only) 行尾。CR相同的脚本也需要在所有三个平台上正确运行。

在处理此类事情时，我通常会采取束手无策的方法。一种应该起作用的方法：

sub read_file_into_array
{
    my $file = shift;
    my ($len, $cnt, $data, @file);

    open my $fh, "<", $file         or die "Can't read $file: $!";
    seek $fh, 0, 2                  or die "Can't seek $file: $!";
    $len = tell $fh;
    seek $fh, 0, 0                  or die "Can't seek $file: $!";

    $cnt = read $fh, $data, $len;
    close $fh;

    $cnt == $len or die "Attempted to read $len bytes; got $cnt";

    $data =~ s/\r\n/\n/g;       # Convert DOS line endings to UNIX
    $data =~ s/\r/\n/g;         # Convert Mac line endings to UNIX

    @file = split /\n/, $data;  # Split on UNIX line endings

    return \@file;
}

然后在@file. 对于您的“有趣”标签，您将存储数组索引而不是文件偏移量。数组索引本质上是原始文件中的行号，从 0 而不是 1 开始计数。

要实际扩充文件，而不是遍历散列键，为什么不构造一个由 line-number => thing-to-append 对组成的散列，生成这样的扩充文件：

sub generate_augmented_file
{
    my $file   = shift @_;   # array ref
    my $extras = shift @_;   # hash ref of line => extra pairs
    my $text;        

    foreach my $line ( 0 .. scalar( $file ) - 1 )
    {
        $text .= $file->[$line];
        $text .= $extras->{$line} if defined $extras->{$line};
        $text .= "\n";
    }

    return $text;
}

perl - perl 读取 seek tell 和 text 文件。读取的字节太多。图层和换行符处理

3 回答 3

Related

Reference