在每第 N 次出现分隔符后,是否有一种方法可以将文本文件拆分成片段/块?
示例:下面的分隔符是“+”
entry 1
some more
+
entry 2
some more
even more
+
entry 3
some more
+
entry 4
some more
+
...
有几百万个条目,因此在每次出现分隔符“+”时进行拆分是一个坏主意。例如,我想拆分分隔符“+”的每 50,000 个实例。
Unix 命令“split”和“csplit”似乎并没有这样做......
使用awk你可以:
awk '/^\+$/ { delim++ } { file = sprintf("chunk%s.txt", int(delim / 50000)); print >> file; }' < input.txt
更新:
要不包括分隔符,请尝试以下操作:
awk '/^\+$/ { if(++delim % 50000 == 0) { next } } { file = sprintf("chunk%s.txt", int(delim / 50000)); print > file; }' < input.txt
该next关键字导致 awk 停止处理该记录的规则并前进到下一个(行)。我也将其更改为>>,>因为如果您多次运行它,您可能不想附加旧的块文件。
如果你找不到合适的替代品,在 Perl 中做起来并不难(它会表现得很好):
#!/usr/bin/env perl
use strict;
use warnings;
# Configuration items - could be set by argument handling
my $prefix = "rs."; # File prefix
my $number = 1; # First file number
my $width = 4; # Number of digits to use in file name
my $rx = qr/^\+$/; # Match regex
my $limit = 3; # 50,000 in real case
my $quiet = 0; # Set to 1 to suppress file names
sub next_file
{
my $name = sprintf("%s%.*d", $prefix, $width, $number++);
open my $fh, '>', $name or die "Failed to open $name for writing";
print "$name\n" unless $quiet;
return $fh;
}
my $fh = next_file; # Output file handle
my $counter = 0; # Match counter
while (<>)
{
print $fh $_;
$counter++ if (m/$rx/);
if ($counter >= $limit)
{
close $fh;
$fh = next_file;
$counter = 0;
}
}
close $fh;
这远非单线;我不确定这是否是一个优点。应该配置的项目组合在一起,例如,可以通过命令行选项进行设置。您最终可能会得到一个空文件;您可以发现它并在必要时将其删除。你需要第二个柜台;现有的是一个“匹配计数器”,但您还需要一个行计数器,如果行计数器为零,您将删除最后一个文件。您还需要名称才能将其删除……虽然很繁琐,但并不难。
给出输入(基本上是样本数据的两个副本),repsplit.pl(重复拆分)的输出如下所示:
$ perl repsplit.pl data
rs.0001
rs.0002
rs.0003
$ cat data
entry 1
some more
+
entry 2
some more
even more
+
entry 3
some more
+
entry 4
some more
+
entry 1
some more
+
entry 2
some more
even more
+
entry 3
some more
+
entry 4
some more
+
$ cat rs.0001
entry 1
some more
+
entry 2
some more
even more
+
entry 3
some more
+
$ cat rs.0002
entry 4
some more
+
entry 1
some more
+
entry 2
some more
even more
+
$ cat rs.0003
entry 3
some more
+
entry 4
some more
+
$
在简洁的“单行”中使用perl和+作为输入分隔符:
如果您想$_ > newprefix.part.$c按照评论中的说明进行操作:
$ limit=50000 perl -053 -Mautodie -lne '
BEGIN{$\=""}
$count++;
if ($count >= $ENV{limit}) {
open my $fh, ">", "newprefix.part.$c";
print $fh $_;
close $fh;
}
' file.txt
$ ls -l newprefix.part.*