python - Maintaining a ratio when splitting up data in python function

Question

I have some data and I want to split it up into smaller groups that maintain a common ratio. I wrote a function that will take an input of two array and calculate the size ratio and then tell me the options for how many groups I can split it into (if all the groups are the same size), here is the function:

def cross_validation_group(train_data, test_data):
    import numpy as np
    from calculator import factors
    test_length = len(test_data)
    train_length = len(train_data)
    total_length = test_length + train_length
    ratio = test_length/float(total_length)
    possibilities = factors(total_length)
    print possibilities
    print possibilities[len(possibilities)-1] * ratio
    super_count = 0
    for i in possibilities:
        if i < len(possibilities)/2:
            pass
        else: 
            attempt = float(i * ratio)
            if attempt.is_integer():
                print str(i) + " is an option for total size with " +  str(attempt) + " as test size and " + str(i - attempt) + " as train size! This is with " + str(total_length/i) + " folds."
            else:
                pass
    folds = int(raw_input("So how many folds would you like to use? If no possibilities were given that would be sufficient, type 0: "))
    if folds != 0:
        total_size = total_length/folds
        test_size = float(total_size * ratio)
        train_size = total_size - test_size
        columns = train_data[0]
        columns= len(columns)
        groups = np.empty((folds,(test_size + train_size),columns))
        i = 0
        a = 0
        b = 0
        for j in range (0,folds):
            test_size_new = test_size * (j + 1)
            train_size_new = train_size * j
            total_size_new = (train_size + test_size) * (j + 1)
            cut_off = total_size_new - train_size
            p = 0
            while i < total_size_new:
                if i < cut_off:
                    groups[j,p] = test_data[a]
                    a += 1
                else:
                    groups[j,p] = train_data[b]
                    b += 1
                i += 1
                p += 1
        return groups
    else:
        print "This method cannot be used because the ratio cannot be maintained with equal group sizes other than for the options you were givens"

So my question is how can I make it so that a third input to the function that will be the number of folds and change the function around so that rather than iterating through to make sure that each group has the same amount with the right ratio, it will just have the right ratio, but varying sizes?

Addition for @JamesHolderness

So your method is almost perfect, but here is one issue:

with lengths 357 and 143 with 9 folds, this is the returning list:

[(39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16)]

now when you add up the columns, you get this: 351 144

the 351 is fine because it's less than 357, but the 144 doesn't work because it is greater than 143! The reason for this is that 357 and 143 are lengths of arrays, so the 144th row of that array does not exist...

score 4 · Accepted Answer

这是我认为可能对您有用的算法。

您将 test_length 和 train_length 除以它们的 GCD 以获得简单分数的比率。你把分子和分母加在一起，这就是你的组的大小因素。

例如，如果比例为 3:2，则每组的大小必须是 5 的倍数。

然后，您将总长度除以折叠数以获得第一组的理想大小，这很可能是一个浮点数。您找到小于或等于 5 的最大倍数，这就是您的第一组。

从您的总数中减去该值，然后除以 folds-1 以获得下一组的理想尺寸。再次找到 5 的最大倍数，从总数中减去 5，然后继续计算所有组。

一些示例代码：

total_length = test_length + train_length          
divisor = gcd(test_length,train_length)
test_multiple = test_length/divisor
train_multiple = train_length/divisor
total_multiple = test_multiple + train_multiple 

# Adjust the ratio if there isn't enough data for the requested folds
if total_length/total_multiple < folds:
  total_multiple = total_length/folds
  test_multiple = int(round(float(test_length)*total_multiple/total_length))
  train_multiple = total_multiple - test_multiple

groups = []
for i in range(folds,0,-1):
  float_size = float(total_length)/i
  int_size = int(float_size/total_multiple)*total_multiple
  test_size = int_size*test_multiple/total_multiple
  train_size = int_size*train_multiple/total_multiple
  test_length -= test_size    # keep track of the test data used
  train_length -= train_size  # keep track of the train data used
  total_length -= int_size
  groups.append((test_size,train_size))

# If the test_length or train_length are negative, we need to adjust the groups
# to "give back" some of the data.
distribute_overrun(groups,test_length,0)
distribute_overrun(groups,train_length,1)

这已更新以跟踪每个组（测试和训练）使用的大小，但如果我们最初使用太多也不必担心。

然后在最后，如果有任何超支（即test_length或train_length已经变为负数），我们通过在尽可能多的项目中减少比率的适当一侧来将超支分配回组中，以使超支回到零。

该distribute_overrun功能包括在下面。

def distribute_overrun(groups,overrun,part):
    i = 0
    while overrun < 0:
      group = list(groups[i])
      group[part] -= 1
      groups[i] = tuple(group)
      overrun += 1
      i += 1

最后，groups 将是一个包含每个组的 test_size 和 train_size 的元组列表。

如果这听起来像您想要的那种东西，但您需要我扩展代码示例，请告诉我。

score 3 · Accepted Answer

在另一个问题中，作者想进行与您类似的交叉验证。请看一下这个答案。解决您的问题的答案，就像：

import numpy as np
# in both train_data the first line is used for the cross-validation,
# and the other lines will follow, so you can add as many lines as you want
test_data = np.array([ 0.,  1.,  2.,  3.,  4.,  5.])
train_data  = np.array([[ 0.09,  1.9,  1.1,  1.5,  4.2,  3.1,  5.1],
                       [    3,    4,  3.1,   10,   20,    2,    3]])

def cross_validation_group( test_data, train_data):
    om1,om2 = np.meshgrid(test_data,train_data[0])
    dist = (om1-om2)**2
    indexes = np.argsort( dist, axis=0 )
    return train_data[:, indexes[0]]

print cross_validation_group( test_data, train_data )
# array([[  0.09,   1.1 ,   1.9 ,   3.1 ,   4.2 ,   5.1 ],
#        [     3 ,  3.1 ,     4 ,     2 ,    20 ,     3 ]])

您将拥有train_data与中定义的间隔相对应的值test_data。

python - Maintaining a ratio when splitting up data in python function

2 回答 2

Related

Reference