linux - 基于正则表达式（LInux）对大文件进行分块

Question

我有一个大文本文件，我想根据列的不同值将其分块为较小的文件，列用逗号分隔（它是一个 csv 文件），并且有很多不同的值：

例如

1012739937,2006-11-28,d_02245211
1012739937,2006-11-28,d_02238545
1012739937,2006-11-28,d_02236564
1012739937,2006-11-28,d_01918338
1012739937,2006-11-28,d_02148765
1012739937,2006-11-28,d_00868949
1012739937,2006-11-28,d_01908448
1012740478,1998-06-26,d_01913689
1012740478,1998-06-26,i_4869
1012740478,1998-06-26,d_02174766

我想将文件分成较小的文件，以便每个文件包含属于一年的记录（一个用于 2006 的记录，一个用于 1998 的记录等）

（这里我们可能有有限的年数，但我想用特定列的更多不同值来做同样的事情）

score 4 · Accepted Answer

您可以使用 awk：

awk -F, '{split($2,d,"-");print > d[1]}' file

解释：

-F,              tells awk that input fields are separated by ','

split($2,d,"-")  splits the second column (the date) by '-'
                 and puts the bits into the array 'd'

print > d[1]     prints the whole input line into a file named after the year

score 3 · Accepted Answer

一个快速的 awk 解决方案，如果稍微脆弱（假设第二列，如果存在，总是开始yyyy）

awk -F, '$2{print > (substr($2,0,4) ".csv")}' test.in

它将输入拆分为文件yyyy.csv；确保它们不存在于您的当前目录中，否则它们将被覆盖。

score 3 · Accepted Answer

3

不同的 awk 方法：使用稍微复杂的字段分隔符：

awk -F '[,-]' '{print > $2}' file

于 2013-06-22T18:38:25.913 回答

linux - 基于正则表达式（LInux）对大文件进行分块

3 回答 3

Related

Reference