计算完行数后,返回文件的开头,然后开始将行复制到danish.train.part-01
. 当行号是 10% 测试集大小的倍数时,为下一部分打开一个新文件。
#!/usr/bin/env python2.7
trainFile = open("danish.train")
numberOfLines = 0
for line in trainFile:
numberOfLines += 1
lengthTest = numberOfLines / 10
# rewind file to beginning
trainFile.seek(0)
numberOfLines = 0
file_number = 0
for line in trainFile:
if numberOfLines % lengthTest == 0:
file_number += 1
output = open('danish.train.part-%02d' % file_number, 'w')
numberOfLines += 1
output.write(line)
在这个输入文件上(对不起,我不会说丹麦语!):
one
two
three
four
five
six
seven
eight
nine
ten
eleven
twelve
thirteen
fourteen
fifteen
sixteen
seventeen
eighteen
nineteen
twenty
twenty-one
twenty-two
twenty-three
twenty-four
twenty-five
twenty-six
twenty-seven
twenty-eight
twenty-nine
thirty
这会创建文件
danish.train.part-01
danish.train.part-02
danish.train.part-03
danish.train.part-04
danish.train.part-05
danish.train.part-06
danish.train.part-07
danish.train.part-08
danish.train.part-09
danish.train.part-10
例如,第 5 部分包含:
thirteen
fourteen
fifteen