哈希是索引文件的好方法。你说:
我需要从 a.txt 到 c.txt 的所有行。但是在从 b.txt 中选择行到 c.txt 中时,首先我需要查看 a.txt。如果该行已经存在于 a.txt 中,那么在写入 c.txt [输出] 时我不应该考虑该 b.txt 行。
这意味着您只需要真正索引a.txt
. 您也没有提及您计划如何将文件合并a.txt
到c.txt
. 您是否从每个文件中读取一行?最终输出是否应该进行排序?
而且,匹配线是什么意思?你的意思是整行匹配,还是只匹配第一个分号?
为了保持灵活性,我将把所有的行读入各种数组,然后让你从那里整理出来。
- 我们将在 a.txt 中读入一个数组和一个散列以进行索引。
- 我们将 b.txt 读入另一个数组,但我们将跳过哈希索引中的任何行。
- 我们将在 c.txt 中读入另一个数组。
- 你可以用这些数组做你想做的事,并以你喜欢的方式合并这些行。
这是程序:
#! /usr/bin/env perl
# Preliminary stuff. The first two are always a must
use strict;
use warnings;
use autodies; # No need to test on read/write or open/close failures
# This is how Perl defines constants. It's not great
# And unlike variables, they don't easily interpolate in
# strings. But, this is what is native to Perl. There are
# optional modules like "Readonly" that do a better job.
use constant {
FILE_A => 'a.txt',
FILE_B => 'b.txt',
FILE_C => 'c.txt',
};
# I'll use this for indexing
my %a_hash;
# I'll put the file contents in these three arrays
my @a_array;
my @b_array;
my @c_array;
open my $a_fh, "<", FILE_A;
# I'm reading each line of FILE_A. As I read it,
# I'll get the first field and put that as an index
# to my hash
while ( my $line ~= <$a_fh> ) {
chomp $line;
$line = /^(.+?);/; # This strips the first field from the line
$a_hash{$1} = 1; # Now, I'll use the first field as my index to my hash
push @a_array, $line; # This adds the line to the end of the array
}
close $a_fh;
# I'll do the same for FILE_B as I did for FILE_A
# I'll go through line by line and push them into @b_array.
# One slight difference. I'll pull out the first field in
# my line, and see if it exists in my %a_hash where I indexed
# the lines in FILE_A. If that line does not exist in my %a_hash
# index, I'll push it into my @b_array
open my $b_fh, "<", FILE_B;
while ( my $line = <$b_fh> ) {
$line ~= /^(.+?);/;
if ( not exists $a_hash{$1} ) {
push @b_array, $line;
}
}
close $b_fh;
# Now, I'll toss all the lines in FILE_C into @c_array
# I can do a bit of a shortcut because I don't process
# the lines. I'll just put the whole file into @c_array
# in one fell swoop. I can use "chomp" to remove the NL
# from the end of each item of @c_array in a single line.
open my $c_fh, "<", FILE_C;
@c_array = <$c_fh>;
chomp @c_array;
close $c_fh;
# At this point, @a_array contains the entire contents of FILE_A
# in the order of that file. @c_array also contains all the lines in
# FILE_C in the order of that file. @b_array is a bit different, it
# also contains all of the lines in FILE_B **except for those lines
# whose first column were already in FILE_A.
#
# You don't specify exactly what you want done at this point. Do
# you want to combine @a_array with @b_array? Here's how we can do
# that:
my @combined_array = sort (@a_array, @b_array);
现在,你有了代表你的三个文件的三个数组,这三个数组的顺序与你的文件相同。
@a_array
并@c_array
包含所有分别在a.txt
和中的行c.txt
。@b_array
包含在 中b.txt
但不在 中的所有行a.txt
。
现在,您可以获取这三个数组并按照您想要的方式合并它们。