I want to compute a matching or a marriage. I have, say, a men basket in table2, and women in table1. For each row of table1, I want to evaluate a distance for that particular individual idx1 with all rows in table2. Then I select the argmin idx2, and save somewhere that idx1 is matched with idx2. Algorithm is not finished because I want to remove idx2 from the men basket (table2) before going to next idx1. The distance is a function of variables from table1 and table2, typically, score_str = '(table2[age] - table1[age][idx1])^2 + (table2[diploma] - table1[diploma][idx1])^2' ( table1[varname][idx1] becomes temp[varname] in the code below)
Because, I use pandas, I wrote the following code, but it take 15 seconds to match around 2000 men and 2000 women. I'm not sure the use of pandas is an advantage here. I may have to change. Computing time is important in that case as I'll match much much bigger databases (around 2 millions).
Edit: first comments are right it's a quadratic algorithm so it will take time anyway and the size = 2,000,000 will surely still a dream. An other stage will be to split the big dataset in smaller chunks (but that will be done with an economist point of view). The faster is the algorithm, the biggest the chunks can be, so it's still important for me to improve computing time.
Thanks for any help.
import pandas as pd
import pdb
import numpy as np
size = 5000
score_str = "(table2['age']-temp['age'])**2 + 5*(table2['diploma']-temp['diploma'])"
table2 = pd.DataFrame(np.random.randn(size, 2), columns=['age','diploma'])
table1 = pd.DataFrame(np.random.randn(size, 2), columns=['age','diploma'])
match = pd.Series(0, index=table1.index)
index2 = pd.Series(True, index=table2.index)
k_max = min(len(table2), len(table1))
def matching():
for k in xrange(k_max):
temp = table1.iloc[k]
score = eval(score_str)[index2]
idx2 = score.idxmax()
match.iloc[k] = idx2 # print( k, 0, index2)
index2[idx2] = False
return match
matching()
Edit : rather than the idea of RussW, I've translated my code from pandas to numpy. It's the first small step to a lower-level language, isn't it ? That way my simulation is for time faster. With n=2,000,000 the calculus lasts seven hours. In my world (microeconomics) it's look like a reasonable time.
def run_time_np(n):
table2 = np.random.randint(0,100, [n,2])
table1 = np.random.randint(0,100, [n,2])
idx2 = np.array([np.arange(n)])
table2 = np.concatenate((table2, idx2.T), axis=1)
match = np.empty(n, dtype=int)
k_max = min(len(table2), len(table1))
score_str = "(table2[:,0]-temp[0])**2 + 5*(table2[:,1]-temp[1])"
k_max = min(len(table2), len(table1))
start = time.clock()
for k in xrange(k_max):
temp = table1[k]
score = eval(score_str)
idx = score.argmax()
idx2 = table2[score.argmax(),2]
match[k] = idx2
table2 = np.delete(table2, idx, 0)
print 'taille: ',n,' ; temps de calcul: ', time.clock()-start
return match