python - 使用 shell/python/perl 从文件中提取行仅一次

Question

我有一个带有数字的大文件，例如：

每天我提取一些大文件的数字并将这个日期数字保存在第二个文件中。每天都有新数字添加到我的大文件中的源数据中。我需要为提取作业制作一个过滤器，以确保我不会提取我已经提取的数字。我怎么能这样做bash 或 python脚本？

注意：我无法从源数据“大文件”中删除数字，我需要它保持完整，因为当我从文件中提取完数字后，我需要原始+更新的数据用于第二天的工作。如果我创建文件的副本并删除副本的编号，则不会考虑添加的新编号。

score 2 · Accepted Answer

将大文件中的所有数字读入一个集合，然后针对它测试新数字：

with open('bigfile.txt') as bigfile:
    existing_numbers = {n.strip() for n in bigfile}

with open('newfile.txt') as newfile, open('bigfile.txt', 'w') as bigfile:
    for number in newfile:
        number = number.strip()
        if number not in existing_numbers:
            bigfile.write(number + '\n')

bigfile这会以尽可能有效的方式将尚未添加到末尾的数字添加到末尾。

如果bigfile变得太大而无法有效运行，您可能需要改用数据库。

score 1 · Accepted Answer

您可以将源文件的排序版本和提取的数据保存到临时文件中，并且可以使用标准 POSIX 工具comm来显示公共行/记录。这些行记录将成为您在后续提取作业中使用的“过滤器”的基础。如果您使用命令从source.txt文件中提取记录，那么类似的内容将成为您脚本的一部分 - 与您用于提取记录的任何其他标准一样长。为获得最佳结果，应对和文件进行排序。$SHELLgrep -v [list of common lines]source.txtextracted.txt

这是典型comm输出的快速剪切和粘贴。该序列显示“大文件”、提取的数据，然后是comm显示文件唯一行的最终命令source.txt（请参阅man comm(1)如何comm工作）。下面是一个使用任意模式进行搜索的示例，其中grepand 作为“过滤器”，不包括公共文件。

% cat source.txt                           
3120987654
3106982658
3420787642
3210957659
3320987654
3520987654
3520987754
3520987954
3520988654
3520987444

% cat extracted.txt 
3120987654
3106982658
3420787642
3210957659
3320987654

% comm -2 -3 source.txt extracted.txt  # show lines only in source.txt
3520987754
3520987954
3520988654
3520987444

comm选择或拒绝两个文件共有的行。该实用程序符合 IEEE Std 1003.2-1992 (“POSIX.2”)。我们可以保存它的输出以供使用grep：

% comm -1 -2 source.txt extracted.txt | sort > common.txt
% grep -v -f common.txt source.txt | grep -E ".*444$"

这将grep文件source.txt和排除行source.txt和extracted.txt; 然后用管道 ( |) 和grep这些“过滤”结果提取一条新记录（在这种情况下，一行或多行以“444”结尾）。如果文件非常大，或者如果您想保留原始文件中的数字顺序和提取的数据，那么问题会更复杂，响应需要更详细。

请参阅我的其他回复或开始使用perl.

score 0 · Accepted Answer

懒惰perl的做法。

只需编写您自己的selection()子程序来替换grep {/.*444$/};-)

#!/usr/bin/env perl  
use strict; use warnings; use autodie;                      
use 5.16.0 ; 

use Tie::File;        
use Array::Utils qw(:all); 

tie my @source, 'Tie::File', 'source.txt' ;               
tie my @extracted, 'Tie::File', 'extracted.txt' ;

# Find the intersection                                                   
my @common = intersect(@source, @extracted);                      

say "Numbers already extracted"; 
say for @common       

untie @@source;
untie @extracted;

文件更新后，source.txt您可以从中选择：

#!/usr/bin/env perl  
use strict; use warnings; use autodie;              
use 5.16.0 ; 

use Tie::File;        
use Array::Utils qw(:all); 

tie my @source, 'Tie::File', 'source.txt' ;               
tie my @extracted, 'Tie::File', 'extracted.txt' ;

# Find the intersection                                                   
my @common = intersect(@source, @extracted);                      

# Select from source.txt excluding numbers already selected:
my @newselect = array_minus(@source, @common);
say "new selection:";
# grep returns list $selection needs "()" for list context.
my ($selection) = grep {/.*444$/} @newselect; 
push @extracted, $selection ;
say "updated extracted.txt" ; 

untie @@source;
untie @extracted;

这使用了两个模块......欢迎简洁和惯用的版本！

score 0 · Accepted Answer

我认为您不是在要求唯一值，而是希望自上次查看文件以来添加所有新值？

假设 BigFile 一直在获取新数据。

我们希望 DailyFilemm_dd_yy 包含在前 24 小时内收到的新号码。

这个脚本会做你想做的事。每天运行它。

BigFile=bigfile
DailyFile=dailyfile
today=$(date +"%m_%d_%Y")
# Get the month, day, year for yesterday.
yesterday=$(date -jf "%s" $(($(date +"%s") - 86400)) +"%m_%d_%Y")

cp $BigFile $BigFile$today
comm -23 $BigFile $BigFile$yesterday > $DailyFile$today
rm $BigFile$yesterday

comm显示不在两个文件中的行。

通讯示例：

#values added to big file
echo '111
222
333' > big

cp big yesterday

# New values added to big file over the day
echo '444
555' >> big

# Find out what values were added.
comm -23 big yesterday > today
cat today

输出

444
555

python - 使用 shell/python/perl 从文件中提取行仅一次

4 回答 4

输出

Related

Reference