我认为您需要对文件进行预处理以强制排序,使用强制排序对其进行排序,然后删除排序信息。该awk
脚本删除了空格,因此它适合在没有水平滚动条的情况下在 SO 中的一行 - 删除空格会造成伤害。
awk -F, '{if(order[$1]==0)order[$1]=++counter;print order[$1]","$0;}' infile.csv |
sort -t, -s -k1,1n |
sed 's/^[^,]*,//'
这可以查看之前是否$1
已经看到过航班号(字段)。如果不是,则为其分配一个新的序号。然后输出该记录,并在行首输出序号,后跟逗号和原始记录。输出排序稳定(即使程序不支持,也可以修改,awk
以便获得稳定排序)。sort
然后排序后的输出去掉了序号;我选择使用sed
but cut
or awk
or a number of other programs can be used instead。
我使用下面显示的输入文件进行了测试,该文件基于问题中的数据,但我已经重新排序,以便文件 '587 和 '588 出现在输入中的 '572 之前(所以这是一个更严格的测试排序)并且我还通过将第四轨迹列中的最后一位数字设置为每条记录的不同值来使每一行唯一(但这些值是按降序排列的,因此如果排序正在处理数据,它将被淘汰顺序)。
20110117559515, , , , , , , , ,2446,6720,370,42
20110117559587, , , , , , , , ,2385,6273,390,54
20110117559588, , , , , , , , ,2816,6847,250,32
20110117559572, , , , , , , , ,2390,6274,410,54
20110117559574, , , , , , , , ,2391,6284,390,54
20110117559515, , , , , , , , ,24xx,67xx,3xx,49
20110117559515, , , , , , , , ,24xx,67xx,3xx,48
20110117559572, , , , , , , , ,2390,6274,410,59
20110117559515, , , , , , , , ,2446,6720,370,47
20110117559515, , , , , , , , ,24xx,67xx,3xx,46
20110117559515, , , , , , , , ,24xx,67xx,3xx,45
20110117559572, , , , , , , , ,23xx,62xx,4xx,58
20110117559572, , , , , , , , ,23xx,62xx,4xx,57
这个输出对我来说是正确的:
20110117559515, , , , , , , , ,2446,6720,370,42
20110117559515, , , , , , , , ,24xx,67xx,3xx,49
20110117559515, , , , , , , , ,24xx,67xx,3xx,48
20110117559515, , , , , , , , ,2446,6720,370,47
20110117559515, , , , , , , , ,24xx,67xx,3xx,46
20110117559515, , , , , , , , ,24xx,67xx,3xx,45
20110117559587, , , , , , , , ,2385,6273,390,54
20110117559588, , , , , , , , ,2816,6847,250,32
20110117559572, , , , , , , , ,2390,6274,410,54
20110117559572, , , , , , , , ,2390,6274,410,59
20110117559572, , , , , , , , ,23xx,62xx,4xx,58
20110117559572, , , , , , , , ,23xx,62xx,4xx,57
20110117559574, , , , , , , , ,2391,6284,390,54
你也可以在 Perl 中相当快地完成整个事情。在具有一个或多个 GiB RAM 的机器中,66 MiB 的数据量几乎不会是过多的数据。
这个 Perl 脚本 (genflights.pl) 创建了大约 69 MiB 的数据:
#!/usr/bin/env perl
use strict;
use warnings;
my $seq = 1000000;
for my $time (0..1500)
{
for my $flight (0..1000)
{
my $r0 = int(rand(1000)) + 20110117559000;
my $r1 = int(rand(10000));
my $r2 = int(rand(10000));
my $r3 = int(rand(1000));
my $r4 = int(rand(100));
printf "%s, , ,%07d, ,%04d,%04d,%03d,%02d\n", $r0, ++$seq, $r1, $r2, $r3, $r4;
}
}
一次运行的前几行输出是:
20110117559486, , ,1000001, ,2670,6847,792,91
20110117559489, , ,1000002, ,0278,1929,972,25
20110117559845, , ,1000003, ,9169,4915,145,21
20110117559356, , ,1000004, ,3519,1660,106,97
20110117559976, , ,1000005, ,8988,7830,884,64
20110117559446, , ,1000006, ,7459,7458,791,93
20110117559442, , ,1000007, ,7265,5853,012,41
20110117559686, , ,1000008, ,4624,0682,859,32
20110117559081, , ,1000009, ,3624,0264,017,06
20110117559336, , ,1000010, ,6501,9033,329,33
20110117559869, , ,1000011, ,5020,3008,919,96
20110117559047, , ,1000012, ,5747,4140,693,83
20110117559531, , ,1000013, ,0591,1866,482,68
20110117559355, , ,1000014, ,2254,2731,946,99
20110117559952, , ,1000015, ,0941,0531,743,85
生成 69 MiB(文件)大约需要 3 秒flights
。然后我在“时间”下运行上面的脚本(将输出重定向到文件flights.out
)并得到输出:
+ awk -F, '{if(order[$1]==0)order[$1]=++counter;print order[$1]","$0;}' flights
+ sort -t, -s -k1,1n
+ sed 's/^[^,]*,//'
real 0m8.658s
user 0m7.881s
sys 0m0.441s
处理 69 MiB 的时间不到 10 秒。
-rw-r--r-- 1 jleffler staff 69115046 Mar 13 09:04 flights
-rw-r--r-- 1 jleffler staff 69115046 Mar 13 09:06 flights.out
输出文件开始:
20110117559486, , ,1000001, ,2670,6847,792,91
20110117559486, , ,1001621, ,2274,5287,188,57
20110117559486, , ,1001642, ,2716,6983,778,49
20110117559486, , ,1002791, ,1704,9426,430,05
...
20110117559486, , ,2501369, ,4900,8239,048,70
20110117559486, , ,2501850, ,7114,8721,684,40
20110117559489, , ,1000002, ,0278,1929,972,25
20110117559489, , ,1000090, ,0114,7462,862,55
20110117559489, , ,1000904, ,7780,8559,121,47
20110117559489, , ,1001499, ,9320,8459,592,01
...
20110117559489, , ,2499635, ,5199,8313,668,30
20110117559489, , ,2499955, ,3386,6280,102,19
20110117559489, , ,2500748, ,5740,6370,594,15
20110117559489, , ,2501534, ,1222,9866,714,24
20110117559845, , ,1000003, ,9169,4915,145,21
20110117559845, , ,1000220, ,5341,8347,724,25
20110117559845, , ,1000295, ,5722,4031,045,11
...
这是在 2.3 GHz Intel Core i7 MacBook Pro、16 GiB RAM、Mac OS X 10.7.5 上运行的。