- 我有一个包含 200 万个条目的 Pandas DataFrame
- 每个条目是 100 维空间中的一个点
- 我想计算最后 N 个点与所有其他点之间的欧几里得距离以找到最近的邻居(为了简化,假设为最后 5 个点找到 top#1 最近的邻居)
- 我已经为一个小数据集完成了下面的代码,但是它相当慢,我正在寻找改进的想法(尤其是速度改进!)
逻辑如下:
- 在我们想要找到最近邻居的 目标之间拆分数据帧并进行比较:我们将在其中寻找邻居的所有其他目标
- 遍历目标
- 计算每个 df_compare 点 VS 目标的平方欧几里得距离
- 选择比较 df 的 top#1 值并将其 ID 保存在目标数据框中
import pandas as pd
import numpy as np
data = {'Name': ['Ly','Gr','Er','Ca','Cy','Sc','Cr','Cn','Le','Cs','An','Ta','Sa','Ly','Az','Sx','Ud','Lr','Si','Au','Co','Ck','Mj','wa'],
'dim0': [33,-9,18,-50,39,-23,-19,89,-74,81,8,23,-63,-62,-14,45,39,-46,74,19,7,97,-29,71,],
'dim1': [-7,75,77,-93,-89,4,-96,-64,41,-27,-87,23,-69,-77,-92,18,21,27,-76,-57,-44,20,15,-76,],
'dim2': [-31,54,-14,-93,72,-14,65,44,-88,19,48,-51,-25,36,-46,98,8,0,53,-47,-29,95,65,-3,],
'dim3': [-12,-86,10,93,-79,-55,-6,-79,-12,66,-81,-14,44,84,9,-19,-69,29,-50,-59,35,-28,90,-73,],
}
df = pd.DataFrame(data)
df_target = df.tail(5)
df_target['closest_neighbour'] = np.nan
df_compare= df.drop(df.tail(5).index)
for i, target_row in df_target.iterrows():
df_compare['distance'] = 0
for dim in df_target.columns:
if dim.startswith('dim'):
df_compare['distance'] = df_compare['distance'] + (target_row[dim] - df_compare[dim])**2
df_compare.sort_values(by=['distance'], ascending=True, inplace=True)
closest_neighbor=df_compare.head(1)
df_target.loc[df_target.index==i,'closest_neighbour']= closest_neighbor['Name'].iloc[0]
print(df_target)
欢迎任何改进逻辑或代码的建议!干杯