python - 如何拆分文件

Question

我有这种格式的文本文件：

10900   PART1   3211034
10900   PART2   3400458
10900   PART4   3183857
10900   PART3   4152115
10900   PART5   3366650
10900   PART6   1548868
10920   PART3   4154075
10920   PART2   3404018
10920   PART1   3207571
10920   PART4   3178505
10920   PART6   1882924
10920   PART5   3363267
10940   PART6   2183534
10940   PART3   4153924
10940   PART4   3178554
10940   PART1   3207436
10940   PART5   3363585
10940   PART2   3404220

我想拆分文件 - 首先，按第一列；第二，第 3 栏的总和不大于 10000000。

这是我根据第一列拆分文件的代码：

file1=open ('Normal_All_TracNum_mod')
data=[]

for line in file1.readlines():
line_split=line.split()
data.append(line_split)

RCV_check= data[0][0]
filewrite=open(RCV_check,"w")

for i in range(0,len(data)):
    if (data[i][0] == RCV_check):
        filewrite.write(data[i][0]+ "          "+data[i][1]+'\n')

    else:
        RCV_check=data[i][0]
        filewrite.close()
        filewrite=open(RCV_check,"w")
        filewrite.write(data[i][0]+ "          "+data[i][1]+'\n')   
filewrite.close()

输出我想要的：

File 1
 10900  PART1   3211034
 10900  PART2   3400458
 10900  PART4   3183857
File 2
 10900  PART3   4152115
 10900  PART5   3366650
 10900  PART6   1548868
...etc

我需要在每个文件中具有相同的第 1 列和第 3 列的总和（3211034 + 3400458 + 3183857）不大于 10000000 等其他文件......

score 2 · Accepted Answer

这是一种使用方法awk：

awk '{ s+=$3 } s>=10000000 || $1!=x { s=$3; c++ } { print > "File" c; x=$1 }' file

这将创建 7 个文件。这是grep . File*显示每个文件内容的输出：

File1:10900   PART1   3211034
File1:10900   PART2   3400458
File1:10900   PART4   3183857
File2:10900   PART3   4152115
File2:10900   PART5   3366650
File2:10900   PART6   1548868
File3:10920   PART3   4154075
File3:10920   PART2   3404018
File4:10920   PART1   3207571
File4:10920   PART4   3178505
File4:10920   PART6   1882924
File5:10920   PART5   3363267
File6:10940   PART6   2183534
File6:10940   PART3   4153924
File6:10940   PART4   3178554
File7:10940   PART1   3207436
File7:10940   PART5   3363585
File7:10940   PART2   3404220

score 0 · Accepted Answer

我不明白你想对第一列做什么。但是，这里有一些 python 尊重对第二列总和的约束

fileID = itertools.count(1) with open('path/to/file') as infile: sum = 0 threshold = 10000000 outfile = open("file%d"%fileID, 'w')

for line in infile:
    val = int(line.strip().split()[-1])
    if threshold-sum >= val:
        outfile.write(line)
    else:
        outfile.close()
        sum = 0
        outfile = open("file%d"%next(fileID), 'w')
        outfile.write(line)

    sum += val

outfile.close()

希望这可以帮助

score 0 · Accepted Answer

如果我正确地了解了您的规格，则波纹管可能对您有用。基本上它检查第二个字段是否大于 1000，如果是，则将其打印到filec（c是计数器）然后重置第二个字段的总和并增加文件计数器等。

awk 'BEGIN {c=1}
     $3>10000000 {print $0 > ("file" c) ; c++ ; sum=0 } 
     $3< 10000000 {print $0 > ("file" c) ; sum+=$3 ; if (sum> 10000000) {sum=0;c++}}' INPUTFILE

如果要拆分第一列和第三列的总和：

awk 'NR==1 {f=$1; c=1 ; fname=f c ; s=$3 ; print $0 > (fname)}
     NR>1  {if ($1 != f) {f=$1 ; c=1 ; fname=f c; s=$3 } } 
     NR>1  {if (s<10000000) {print $0 > (fname); s+=$3} else {c++;fname=f c;s=$3; print $0 > (fname)} }' INPUTFILE

是的，我知道这可以缩短...

python - 如何拆分文件

3 回答 3

Related

Reference