python - fastest way to read a table2 for each value of a table1 with Python?

Question

I want to compute a matching or a marriage. I have, say, a men basket in table2, and women in table1. For each row of table1, I want to evaluate a distance for that particular individual idx1 with all rows in table2. Then I select the argmin idx2, and save somewhere that idx1 is matched with idx2. Algorithm is not finished because I want to remove idx2 from the men basket (table2) before going to next idx1. The distance is a function of variables from table1 and table2, typically, score_str = '(table2[age] - table1[age][idx1])^2 + (table2[diploma] - table1[diploma][idx1])^2' ( table1[varname][idx1] becomes temp[varname] in the code below)

Because, I use pandas, I wrote the following code, but it take 15 seconds to match around 2000 men and 2000 women. I'm not sure the use of pandas is an advantage here. I may have to change. Computing time is important in that case as I'll match much much bigger databases (around 2 millions).

Edit: first comments are right it's a quadratic algorithm so it will take time anyway and the size = 2,000,000 will surely still a dream. An other stage will be to split the big dataset in smaller chunks (but that will be done with an economist point of view). The faster is the algorithm, the biggest the chunks can be, so it's still important for me to improve computing time.

Thanks for any help.

import pandas as pd
import pdb
import numpy as np

size = 5000 
score_str = "(table2['age']-temp['age'])**2 +  5*(table2['diploma']-temp['diploma'])"

table2 = pd.DataFrame(np.random.randn(size, 2), columns=['age','diploma'])
table1 = pd.DataFrame(np.random.randn(size, 2), columns=['age','diploma'])

match = pd.Series(0, index=table1.index)
index2 = pd.Series(True, index=table2.index)  
k_max = min(len(table2), len(table1))
def matching():
    for k in xrange(k_max):   
        temp = table1.iloc[k] 
        score = eval(score_str)[index2]
        idx2 = score.idxmax()
        match.iloc[k] = idx2 # print( k, 0, index2)
        index2[idx2] = False

    return match

matching()

Edit : rather than the idea of RussW, I've translated my code from pandas to numpy. It's the first small step to a lower-level language, isn't it ? That way my simulation is for time faster. With n=2,000,000 the calculus lasts seven hours. In my world (microeconomics) it's look like a reasonable time.

def run_time_np(n):
    table2 = np.random.randint(0,100, [n,2])
    table1 = np.random.randint(0,100, [n,2])
    idx2 = np.array([np.arange(n)])
    table2 = np.concatenate((table2, idx2.T), axis=1)

    match = np.empty(n, dtype=int)
    k_max = min(len(table2), len(table1))
    score_str = "(table2[:,0]-temp[0])**2 +  5*(table2[:,1]-temp[1])"
    k_max = min(len(table2), len(table1))
    start = time.clock()
    for k in xrange(k_max):   
        temp = table1[k]
        score = eval(score_str)
        idx = score.argmax()
        idx2 = table2[score.argmax(),2]
        match[k] = idx2 
        table2 = np.delete(table2, idx, 0)
    print 'taille: ',n,' ; temps de calcul: ', time.clock()-start
    return match

score 0 · Accepted Answer

You should certainly use a profiler to see where the code is spending time. You'll be able to see if panda is slowing down. Also as far as I understand your algorithm, it is quadratic O(n^2). I don't think you'll be able to run it in a reasonable time for a 2 million size table.

score 0 · Accepted Answer

对于一个表中的每个人，您将使用函数将该人与另一张表中的每个人进行比较(age difference)^2 + (diploma difference)^2。

减少操作数量的一个想法是使用组/桶首先找到一组最小的其他个体，然后使用相同的函数与该最小集进行比较以找到匹配项。

取一张桌子并制作 2 张新桌子，age_groups然后dip_groups. 在 age_groups 中，您将拥有dict带有诸如(20, 25) -> (minage, maxage). dip_groups 也是如此。

然后你的算法看起来像这样：（伪代码）

for individual in table1:
    age, diploma = individual
    for age_bucket, dip_bucket in iterate_buckets(age, diploma):
        matches = age_bucket.intersection(dip_bucket)
        if matches:
            match = get_best_match(matches, age, diploma)
            all_matches.append((individual, match))
            remove_individual(age_groups, match)
            remove_individual(dip_groups, match)

最主要的是iterate_buckets()andget_best_match()函数。

age_groups = [(18, 20), (21, 23), (24, 26), ... ]
dip_groups = [(1, 2), (3, 4), (5, 6) ... ]
group_combinations = [(ag, dg) for ag in age_groups for dp in dip_groups]

def iter_key(age_grp, dip_group, age, dip):
    age_av = sum(age_grp) / 2.0
    dip_av = sum(dip_grp) / 2.0
    return pow(age - age_av, 2) + pow(dip - dip_av, 2)

def iterate_buckets(age, dip):
    combs = sorted(group_combinations, key=lambda grp: iter_key(*grp, age, dip))
    for c in combs:
        yield c

def match_key(indiv1, indiv2):
    age1, dip1 = indiv1
    age2, dip2 = indiv2
    return pow(age1 - age2, 2) + pow(dip1 - dip2, 2)

def get_best_match(matches, age, dip):
    sorted_matches = sorted(key=match_key, zip(matches, [(age, dip)] * len(matches)))
    return sorted_matches[0]

只是一个想法，我不能100%确定它会更快，或者它会产生相同的预期结果。

python - fastest way to read a table2 for each value of a table1 with Python?

2 回答 2

Related

Reference