python - 熊猫 DataFrame 的 Modin 与线程

Question

我有一个包含 343,500 条记录和一个预定义get_zipcode函数的 DataFrame。

为了加快速度，我将数据一分为四，并使用该模块apply创建了以下线程进程：threading

df['subsections'] = np.resize([1,2,3,4], len(df))

if __name__ == '__main__':
    t1 = threading.Thread(target=df.loc[(df['EMPTY'] == True) & (df['subsections'] == 1)].apply(lambda x: get_zipcode(lat=x['LATITUDE'], lon=x['LONGITUDE']), axis=1))
    t2 = threading.Thread(target=df.loc[(df['EMPTY'] == True) & (df['subsections'] == 2)].apply(lambda x: get_zipcode(lat=x['LATITUDE'], lon=x['LONGITUDE']), axis=1))
    t3 = threading.Thread(target=df.loc[(df['EMPTY'] == True) & (df['subsections'] == 3)].apply(lambda x: get_zipcode(lat=x['LATITUDE'], lon=x['LONGITUDE']), axis=1))
    t4 = threading.Thread(target=df.loc[(df['EMPTY'] == True) & (df['subsections'] == 4)].apply(lambda x: get_zipcode(lat=x['LATITUDE'], lon=x['LONGITUDE']), axis=1))

    t1.start()
    t2.start()
    t3.start()
    t4.start()
    
    t1.join()
    t2.join()
    t3.join()
    t4.join()

这似乎工作得相当好。但是我已经找到了modin module，它（根据我对文档的理解）也使用了多线程。

在这种情况下，我本质上apply是在整个数据帧中使用一个函数，使用threadingvs有优势modin吗？

从更广泛的意义上说，根据文档，不使用有什么好处modin吗？

python - 熊猫 DataFrame 的 Modin 与线程

0 回答 0

Related

Reference