1

我有一个问题,但我觉得解决方案应该很简单。我正在构建一个模型,并希望通过 10 倍交叉验证来测试它的准确性。为此,我必须将我的训练语料库 90%/10% 分成训练和测试部分,然后在 90% 上训练我的模型并在 10% 上进行测试。我想这样做十次,每次都采用不同的 90%/10% 拆分,以便最终将语料库的每一位都用作测试数据。然后我将平均每个 10% 测试的结果。

我试图编写一个脚本来提取 10% 的训练语料库并将其写入一个新文件,但到目前为止我还没有让它工作。我所做的是计算文件中的总行数,然后将此数字除以 10 以了解我要提取的十个不同测试集的大小。

trainFile = open("danish.train")
numberOfLines = 0

for line in trainFile:
    numberOfLines += 1

lengthTest = numberOfLines / 10

对于我自己的训练文件,我发现它由 3638 行组成,因此每个测试应该大致由 363 行组成。

如何将第 1-363 行、第 364-726 行等写入不同的测试文件?

4

3 回答 3

1

未经测试,但这是基本思想:

def getNthSeg(fpath, n, segSize):
    """Get the nth segment of segSize many lines"""
    answer = []
    with open(fpath) as f:
        for i,line in enumerate(f):
            if (segSize-1)*n <= i < segSize*n:
                answer.append(line)
    return answer

def getFolds(fpath, k):
    """ In your case, k is 10"""
    with open(fpath) as f:
        numLines = len(f.readlines())
    segSize = numLines/k
    answer = []
    for n in xrange(k):
        fold = getNthSeg(fpath, n, segSize)
        answer.append(fold)
    return answer
于 2013-02-05T18:50:27.517 回答
1

计算完行数后,返回文件的开头,然后开始将行复制到danish.train.part-01. 当行号是 10% 测试集大小的倍数时,为下一部分打开一个新文件。

#!/usr/bin/env python2.7

trainFile = open("danish.train")
numberOfLines = 0

for line in trainFile:
    numberOfLines += 1

lengthTest = numberOfLines / 10

# rewind file to beginning
trainFile.seek(0)

numberOfLines = 0
file_number = 0
for line in trainFile:
    if numberOfLines % lengthTest == 0:
        file_number += 1
        output = open('danish.train.part-%02d' % file_number, 'w')

    numberOfLines += 1
    output.write(line)

在这个输入文件上(对不起,我不会说丹麦语!):

one
two
three
four
five
six
seven
eight
nine
ten
eleven
twelve
thirteen
fourteen
fifteen
sixteen
seventeen
eighteen
nineteen
twenty
twenty-one
twenty-two
twenty-three
twenty-four
twenty-five
twenty-six
twenty-seven
twenty-eight
twenty-nine
thirty

这会创建文件

danish.train.part-01
danish.train.part-02
danish.train.part-03
danish.train.part-04
danish.train.part-05
danish.train.part-06
danish.train.part-07
danish.train.part-08
danish.train.part-09
danish.train.part-10

例如,第 5 部分包含:

thirteen
fourteen
fifteen
于 2013-02-05T18:50:50.823 回答
1

如果您的文件不是很大,您可以像这样将其拆分为 90/10:

trainFile = open("danish.train")
lines = list(trainFile)
N = len(lines)
testing = lines[:N/10]
training = lines[N/10:]
于 2013-02-05T18:54:04.240 回答