regex - Perl 脚本拆分文件并处理然后根据大小和字符串连接

Question

有人可以帮我在 perl 中遵循一些逻辑，我正在使用 Windows 7。

C:\script>perl split_concatenate.pl large_file a 或 b（输入将是大文件和值 a 或 b 以供稍后处理）。

检查文件是否大于 40KB（一些大小），选择是 "a" ，如果不是运行命令

命令 -i large_file.txt -o large_file_new -a

否则，如果选择是 b

命令 -i large_file.txt -o large_file_new -b

别的

如果大于 40KB+，则为每个 40KB arround 拆分文件（将是 part1），并将第一个“特定字符串”附加到 file_part2 到 part1，如果有多个“特定字符串”，将其保存以进行处理然后创建后续文件，这些文件应在下一部分中以下一个“特定字符串”结尾。（“特定字符串”以某个字符串开头，但以不同的值结尾）。所以脚本应该搜索是否有更多的“特定字符串”，在第 2 部分左右，并附加第一个可用的，如果只有一个可用，不需要任何东西只是拆分。由于文件始终应以特定字符串结尾。

然后处理相同的命令

command -i filepart1.txt -o filepart1.dat -a command -i filepart2.txt -o filepart2.dat（如果需要） -a

或者

command -i filepart1.txt -o filepart1.dat -b command -i filepart2.txt -o filepart2.dat（如果需要） -b

在此之后需要连接。

连接 filepart1.dat + filepart2.dat + filepartN =large_file.dat

我开始使用下面的代码首先找到大小，

#!/usr/bin/perl
    use strict;
    use warnings;
    use File::stat;

    my $filesize = stat("Full_File.txt")->size;

    print "Size: $filesize\n";

    exit 0;

如果有人能帮助我学习，那就太好了。如果这是不可能的，那么@每500行文件达到40KB，所以我认为这会更容易，每500行附加下一个可用的“特定字符串”，如果文件小于1000，则拆分并处理上述命令然后只有 2 行拆分，不需要在 part2 中追加，因为它的 eof 中已经有一个。可能更容易？

更好的解释：

大文件.txt

xxx
xxxx
xxxx
xxxxx
var:value_var_v(1234)
xxxxxx
xxxxx
xxxxx
var:value_var_v(4567)
xxxxxxx
xxxxxx
var:value_var_v(abcd)
xxxxxxx  // first split happens here as here assume it is 40kb
xxxxxx
xxxxxxx
var:value_var_v(efgh)

如果它太大，那么在第 5 行拆分，比如说large_files_part1. 它的结尾应该包含var:value_var_v(1234). 在第 5 行之后，它应该在第 9 行再次分裂，并在最后变成large_files_part2并拥有var:value_var_v(4567)。第 3 部分将一直到第 12 行并包含var:value_var_v(abcd)在最后，依此类推。如果var:value_var_v?第一次拆分后只有一个，那么只要两个部分的行都在 500 左右，那么只有两个部分就可以了。如果主文件中有 1300 行，则需要三个拆分。每个文件的末尾应该有下一个可用的“字符串”，因此第 1001 行将是第var:value_var_v(1234)1000 行之后的第一行。字符串始终以 var:value_var_v 开头，以任何内容结尾。希望这很清楚。

预期输出：第一种情况：因此，如果 _part1.txt 仅出现一次字符串，则输出将是 40,000 左右

xxx
xxxx
xxxx
xxxxx
var:value_var_v(1234)
xxxxxx
xxxxx
xxxxx
var:value_var_v(4567)
xxxxxxx
xxxxxx
var:value_var_v(abcd)
xxxxxxx  // split happened here
var:value_var_v(efgh)

_part2.txt
xxxxxxx
xxxxx
var:value_var_v(efgh)

在我对这些文件（第 1 部分和第 2 部分）进行一些处理后，我再次连接

_part1+_part2=大+文件

连接后的最终大文件：

    xxx
    xxxx
    xxxx
    xxxxx
    var:value_var_v(1234)
    xxxxxx
    xxxxx
    xxxxx
    var:value_var_v(4567)
    xxxxxxx
    xxxxxx
    var:value_var_v(abcd)
    xxxxxxx  // split happened here
    var:value_var_v(efgh)
    ****
  xxxxxxx
  xxxxx
  var:value_var_v(efgh)

第二种拆分和连接案例：

如果该文件太大，比如 80KB 并且在第一次拆分 @40KB 后有很多字符串“var:value_var()，请进行后续拆分，它会看到下一个字符串将再次为“var:value_var_v()”并进行拆分，基于字符串，否则基于大小。每次文件 pat 都应该包含下一个可用的 var:value_var_v()。

原始文件：

    xxx
    xxxx
    xxxx
    xxxxx
    var:value_var_v(1234)
    xxxxxx
    xxxxx
    xxxxx
    var:value_var_v(4567)  
    xxxxxxx 
- - // assume now split happens here as here assume it is 40kb there are two more strings starting with var:value_var_v, split after var:value_var_v(abcd) and print this string in previous parts eof. Then final part will be ending with var:value_var_v(efgh). keep as it is.
    xxxxxx
    xxxxxx
    xxxxx
    var:value_var_v(abcd)
    xxxxxxx  
    xxxxxx
    xxxxxxx
    var:value_var_v(efgh)



part1.txt 
   xxx
    xxxx
    xxxx
    xxxxx
    var:value_var_v(1234)
    xxxxxx
    xxxxx
    xxxxx
    var:value_var_v(4567)  
    xxxxxxx - - // split happens here as here assume it is 40kb
    var:value_var_v(abcd) - //prints next available string which is var:value_var_v(abcd)
    _part2.txt
    xxxxxx
    xxxxxx
    xxxxx
    var:value_var_v(abcd)  

// Here part1 and part2 ends with same string.

    _part3.txt
    xxxxxxx  
    xxxxxx
    xxxxxxx
    var:value_var_v(efgh) - This is last part and size should be below 40KB

处理所有这些 part1,part2,part3 然后连接到一个大文件。

连接后的最终文件

完整的填充看起来像。

xxx
    xxxx
    xxxx
    xxxxx
    var:value_var_v(1234)
    xxxxxx
    xxxxx
    xxxxx
    var:value_var_v(4567)  
    xxxxxxx // split happened here in the first split assumed 40KB 
    var:value_var_v(abcd) 
      ******
    xxxxxx
   xxxxxx
    xxxxx
    var:value_var_v(abcd)  
     ******
   xxxxxxx  
    xxxxxx
    xxxxxxx
    var:value_var_v(efgh)
   ******

PS：一旦我在每个部分得到处理结束，我就会得到一个唯一的值，并在连接期间保留它。

score 1 · Accepted Answer

我认为这非常接近您想要的。

my $part_string = 'var:value_var_v';
my $file_count = 1;
my $total_length = 0;
my $max_length = 40000;

open (my $input, '<', $input_filename);
open (my $output, '>', "${output_filename}_part_${file_count}");
while (my $line = <$input>) {
  print $output $line;
  $total_length += length($line);
  if ($total_length > $max_length) {
    print $output "$part_string$file_count\n";
    close $output;
    $file_count++;
    open ($output, '>', "${output_filename}_part_${file_count}");
    $total_length = 0;
  }
}
if ($total_length > 0) {
  print $output "$part_string$file_count\n";
}
close $output;

regex - Perl 脚本拆分文件并处理然后根据大小和字符串连接

1 回答 1

Related

Reference