I have some data and I want to split it up into smaller groups that maintain a common ratio. I wrote a function that will take an input of two array and calculate the size ratio and then tell me the options for how many groups I can split it into (if all the groups are the same size), here is the function:
def cross_validation_group(train_data, test_data):
import numpy as np
from calculator import factors
test_length = len(test_data)
train_length = len(train_data)
total_length = test_length + train_length
ratio = test_length/float(total_length)
possibilities = factors(total_length)
print possibilities
print possibilities[len(possibilities)-1] * ratio
super_count = 0
for i in possibilities:
if i < len(possibilities)/2:
pass
else:
attempt = float(i * ratio)
if attempt.is_integer():
print str(i) + " is an option for total size with " + str(attempt) + " as test size and " + str(i - attempt) + " as train size! This is with " + str(total_length/i) + " folds."
else:
pass
folds = int(raw_input("So how many folds would you like to use? If no possibilities were given that would be sufficient, type 0: "))
if folds != 0:
total_size = total_length/folds
test_size = float(total_size * ratio)
train_size = total_size - test_size
columns = train_data[0]
columns= len(columns)
groups = np.empty((folds,(test_size + train_size),columns))
i = 0
a = 0
b = 0
for j in range (0,folds):
test_size_new = test_size * (j + 1)
train_size_new = train_size * j
total_size_new = (train_size + test_size) * (j + 1)
cut_off = total_size_new - train_size
p = 0
while i < total_size_new:
if i < cut_off:
groups[j,p] = test_data[a]
a += 1
else:
groups[j,p] = train_data[b]
b += 1
i += 1
p += 1
return groups
else:
print "This method cannot be used because the ratio cannot be maintained with equal group sizes other than for the options you were givens"
So my question is how can I make it so that a third input to the function that will be the number of folds and change the function around so that rather than iterating through to make sure that each group has the same amount with the right ratio, it will just have the right ratio, but varying sizes?
Addition for @JamesHolderness
So your method is almost perfect, but here is one issue:
with lengths 357 and 143 with 9 folds, this is the returning list:
[(39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16)]
now when you add up the columns, you get this: 351 144
the 351 is fine because it's less than 357, but the 144 doesn't work because it is greater than 143! The reason for this is that 357 and 143 are lengths of arrays, so the 144th row of that array does not exist...