1
  • 我有一个包含 200 万个条目的 Pandas DataFrame
  • 每个条目是 100 维空间中的一个点

在此处输入图像描述

  • 我想计算最后 N 个点与所有其他点之间的欧几里得距离以找到最近的邻居(为了简化,假设为最后 5 个点找到 top#1 最近的邻居)
  • 我已经为一个小数据集完成了下面的代码,但是它相当慢,我正在寻找改进的想法(尤其是速度改进!)

逻辑如下:

  • 在我们想要找到最近邻居的 目标之间拆分数据帧并进行比较:我们将在其中寻找邻居的所有其他目标
  • 遍历目标
  • 计算每个 df_compare 点 VS 目标的平方欧几里得距离
  • 选择比较 df 的 top#1 值并将其 ID 保存在目标数据框中
import pandas as pd
import numpy as np

data = {'Name': ['Ly','Gr','Er','Ca','Cy','Sc','Cr','Cn','Le','Cs','An','Ta','Sa','Ly','Az','Sx','Ud','Lr','Si','Au','Co','Ck','Mj','wa'],
        'dim0': [33,-9,18,-50,39,-23,-19,89,-74,81,8,23,-63,-62,-14,45,39,-46,74,19,7,97,-29,71,],
        'dim1': [-7,75,77,-93,-89,4,-96,-64,41,-27,-87,23,-69,-77,-92,18,21,27,-76,-57,-44,20,15,-76,],
        'dim2': [-31,54,-14,-93,72,-14,65,44,-88,19,48,-51,-25,36,-46,98,8,0,53,-47,-29,95,65,-3,],
        'dim3': [-12,-86,10,93,-79,-55,-6,-79,-12,66,-81,-14,44,84,9,-19,-69,29,-50,-59,35,-28,90,-73,],
        }
df = pd.DataFrame(data)

df_target = df.tail(5)
df_target['closest_neighbour'] = np.nan
df_compare= df.drop(df.tail(5).index)

for i, target_row in df_target.iterrows():  
    df_compare['distance'] = 0
    for dim in df_target.columns:
        if dim.startswith('dim'):
            df_compare['distance'] = df_compare['distance'] + (target_row[dim] - df_compare[dim])**2

    df_compare.sort_values(by=['distance'], ascending=True, inplace=True)
    closest_neighbor=df_compare.head(1)
    df_target.loc[df_target.index==i,'closest_neighbour']= closest_neighbor['Name'].iloc[0]

print(df_target)

欢迎任何改进逻辑或代码的建议!干杯

4

4 回答 4

3

构造数据框,drop_duplicates('Name')因为我发现至少Ly出现两次。

import pandas as pd
import numpy as np

data = {'Name': ['Ly','Gr','Er','Ca','Cy','Sc','Cr','Cn','Le','Cs','An','Ta','Sa','Ly','Az','Sx','Ud','Lr','Si','Au','Co','Ck','Mj','wa'],
        'dim0': [33,-9,18,-50,39,-23,-19,89,-74,81,8,23,-63,-62,-14,45,39,-46,74,19,7,97,-29,71,],
        'dim1': [-7,75,77,-93,-89,4,-96,-64,41,-27,-87,23,-69,-77,-92,18,21,27,-76,-57,-44,20,15,-76,],
        'dim2': [-31,54,-14,-93,72,-14,65,44,-88,19,48,-51,-25,36,-46,98,8,0,53,-47,-29,95,65,-3,],
        'dim3': [-12,-86,10,93,-79,-55,-6,-79,-12,66,-81,-14,44,84,9,-19,-69,29,-50,-59,35,-28,90,-73,],
        }
df = pd.DataFrame(data).drop_duplicates('Name')

df_target = df.tail(5)
df_compare= df.drop(df.tail(5).index)

一种方法是apply逐行操作

df_target = df_target.set_index('Name')
df_compare = df_compare.set_index('Name')

df_compare.apply(lambda r: ((r - df_target)**2).sum(axis=1), axis=1).idxmin()

结果:

Name
Au    Ly
Co    Az
Ck    Sx
Mj    Lr
wa    Cn
dtype: object

另一个,这是我的偏好,是在 numpy 数组中进行:

pd.DataFrame(
    ((df_target.values[:, np.newaxis] - df_compare.values)**2).sum(axis=2),
    columns = df_compare.index,
    index = df_target.index,
).idxmin(axis=1)

这将产生相同的结果

于 2022-02-19T12:49:24.060 回答
2

如果您从数据框中获取 numpy 数组,则可以使用 Numba 快速计算它们:

from timeit import default_timer as timer
from numba import jit

import math
import numpy as np

data = {'Name': [""]*1000,
        'dim0': [0]*1000,
        'dim1': [0]*1000,
        'dim2': [0]*1000,
        'dim3': [0]*1000,
        }

# add rest of dims
for i in range(96):
    data["dim"+str(i+4)]=[0]*1000

dims=len(data)-1
N=len(data["Name"])
print(dims)
print(N)


def distance(selected,selected2):
    dif=selected-selected2
    dif*=dif
    dist=0.0
    for val in dif:
        dist += val
    return math.sqrt(dist)

@jit(nopython=True)
def distanceNumba(selected,selected2):
    dif=selected-selected2
    dif*=dif
    dist=0.0
    for val in dif:
        dist += val
    return math.sqrt(dist)

cache=[];

for i in range(N):
    cache.append(np.zeros(dims))

cache=np.asarray(cache)

for i in range(dims):
    for j in range(N):
        cache[j][i]=data["dim"+str(i)][j]

selected=cache[0]

out = np.zeros(N)


def distances(selected,cache,out):
    i=0
    for selected2 in cache:
        out[i]=distance(selected,selected2)
        i=i+1

@jit(nopython=True)
def distancesNumba(selected,cache,out):
    i=0
    for selected2 in cache:
        out[i]=distanceNumba(selected,selected2)
        i=i+1

for i in range(10):
    t1 = timer()
    distances(selected,cache,out)
    t2 = timer()
    print(t2-t1," seconds without numba")

    t1 = timer()
    distancesNumba(selected,cache,out)
    t2 = timer()
    print(t2-t1," seconds with numba")

print(out)

输出:

100
1000
0.05954742600079044  seconds without numba
1.2770428250005352  seconds with numba
0.0608673019996786  seconds without numba
0.0009444430033909157  seconds with numba
0.060780961997807026  seconds without numba
0.0009288470027968287  seconds with numba
0.06079873300041072  seconds without numba
0.0009262589992431458  seconds with numba
0.06075674100065953  seconds without numba
0.0009228080016328022  seconds with numba
0.06063023499882547  seconds without numba
0.0009203600020555314  seconds with numba
0.0607308569997258  seconds without numba
0.0009688009995443281  seconds with numba
0.06056219499805593  seconds without numba
0.000959154000156559  seconds with numba
0.060522257001139224  seconds without numba
0.0009322579971922096  seconds with numba
0.06055161300173495  seconds without numba
0.000933799001359148  seconds with numba

如果你有 CUDA 显卡,那么你也可以使用它:

from timeit import default_timer as timer

from numba import jit
from numba import cuda
import numba
import math
import numpy as np

data = {'Name': [""]*1000,
        'dim0': [0]*1000,
        'dim1': [0]*1000,
        'dim2': [0]*1000,
        'dim3': [0]*1000,
        }

# add rest of dims
for i in range(96):
    data["dim"+str(i+4)]=[0]*1000

dims=len(data)-1
N=len(data["Name"])
print(dims)
print(N)


def distance(selected,selected2):
    dif=selected-selected2
    dif*=dif
    dist=0.0
    for val in dif:
        dist += val
    return math.sqrt(dist)

cache=[];

for i in range(N):
    cache.append(np.zeros(dims))

cache=np.asarray(cache,dtype=np.float32)

for i in range(dims):
    for j in range(N):
        cache[j][i]=data["dim"+str(i)][j]

selected=cache[0]

out = np.zeros(N)


def distances(selected,cache,out):
    i=0
    for selected2 in cache:
        out[i]=distance(selected,selected2)
        i=i+1

@cuda.jit
def distancesNumba(selected,cache,out):
    i=0
    for selected2 in cache:
        dif=cuda.local.array(shape=dims, dtype=numba.float32) 
        dif[cuda.threadIdx.x]=selected[cuda.threadIdx.x]-selected2[cuda.threadIdx.x]
        dif[cuda.threadIdx.x]*=dif[cuda.threadIdx.x]
        dist=0.0
        numba.cuda.syncthreads()
        if cuda.threadIdx.x==0:
            for j in range(dims):
                dist += dif[j]
            out[i]=math.sqrt(dist)
            i=i+1
        
        

for i in range(10):
    t1 = timer()
    distances(selected,cache,out)
    t2 = timer()
    print(t2-t1," seconds without numba")

    t1 = timer()
    # 1=blocks, 100=threads
    distancesNumba[1,100](selected,cache,out)
    t2 = timer()
    print(t2-t1," seconds with numba")

print(out)

输出:

100
1000
0.09203408300163574  seconds without numba
1.3188036539977475  seconds with numba
0.09138548299961258  seconds without numba
0.00905935400078306  seconds with numba
0.09278915700269863  seconds without numba
0.011017041997547494  seconds with numba
0.09266194800147787  seconds without numba
0.009332213001471246  seconds with numba
0.09425603599811438  seconds without numba
0.008965768000052776  seconds with numba
0.09545346899903961  seconds without numba
0.009094572000321932  seconds with numba
0.09400588900098228  seconds without numba
0.008994818999781273  seconds with numba
0.09595919099956518  seconds without numba
0.009412939001776977  seconds with numba
0.09670166400246671  seconds without numba
0.009072380998986773  seconds with numba
0.09463289600171265  seconds without numba
0.009022546000778675  seconds with numba

此性能仅由 100 个 CUDA 线程实现,并且没有对距离进行缩减优化(在计算平方差后只需通过 1 个 cuda 线程添加 100 个值)。您可以调整内核的线程数,通过减少距离求和进行优化并获得更高的性能。

然后,您可以简单地对“out”数组进行排序以获得最短/最长距离。

编辑:优化更多 CUDA 线程(每行 1 个块,每块暗淡线程)+ 减少内核

@cuda.jit
def distancesNumba(selected,cache,out):
    i=0
    selected2=cuda.blockIdx.x
    dif=cuda.local.array(shape=dims, dtype=numba.float32) 
    dif[cuda.threadIdx.x]=selected[cuda.threadIdx.x]-cache[selected2][cuda.threadIdx.x]
    dif[cuda.threadIdx.x]*=dif[cuda.threadIdx.x]
    dist=0.0
    numba.cuda.syncthreads()

    reduction=128
    while reduction>0:
        if reduction+cuda.threadIdx.x < dims:
            dif[reduction] += dif[reduction+cuda.threadIdx.x]
        numba.cuda.syncthreads()
        reduction>>=2
    if cuda.threadIdx.x==0:
        out[selected2]=math.sqrt(dif[0])

并通过以下方式启动它:

# compute distances of selected line to N lines (dims dimensional)
distancesNumba[N,dims](selected,cache,out)

性能变为:

100 <-- dimensions
100000 <-- N
8.610873253001046  seconds without numba
1.8519197440000426  seconds with numba
8.927837033999822  seconds without numba
0.20861789699847577  seconds with numba
8.859914688000572  seconds without numba
0.20821334699940053  seconds with numba
8.980341544000112  seconds without numba
0.21734917999856407  seconds with numba
8.818792854999629  seconds without numba
0.22718691900081467  seconds with numba
9.003342153999256  seconds without numba
0.2289118230000895  seconds with numba
9.02532176999739  seconds without numba
0.20479743699979736  seconds with numba
8.9248614370008  seconds without numba
0.21945642799983034  seconds with numba
    

对于 100 万行和 100 个维度:

100
1000000
81.72404483399805  seconds without numba
8.821669873999781  seconds with numba
86.97093956900062  seconds without numba
2.0208445850003045  seconds with numba
84.32691005200104  seconds without numba
2.0129999929995392  seconds with numba

它的性能只有 40 倍,因为它只对所有线路进行一次距离检查。每条线增加 4-5 个距离来检查会更有效。您还可以进行“平铺”优化,以使用 GPU 的 L1 缓存以获得更高的性能。

于 2022-02-19T14:28:01.857 回答
1

更多 Effienet 算法

您可以使用树方法,而不是使用蛮力

树方法的优点(例如 BallTree、KDTree 等)

对于 Target and Compare 中的 N 个点

  • 蛮力复杂度为 N*N
  • 树方法复杂度为 N*log(N)
  • 这使得 Tree 方法明显快于大 N(例如 2M)的蛮力

代码

import pandas as pd
import numpy as np

from sklearn.neighbors import BallTree

# Data
df_target = df.tail(5)
df_compare= df.drop(df.tail(5).index)

# Create Distance Tree with Euclidean distance metric  (skipping Name column)
tree = BallTree(df_compare[df_compare.columns[1:]].values, metric = "euclidean")

# Find nearest neighbor for each node in target (skipping Name column)
distances, indices = tree.query(df_target[df_target.columns[1:]].values, k = 1)

# Show closest name in compare for each name in target
for target, compare in zip(df_target['Name'], df_compare.iloc[indices.flatten()]['Name']):
    print(f"{target}  {compare}")

输出

Au  Ly
Co  Az
Ck  Sx
Mj  Lr
wa  Cn
于 2022-02-19T14:04:49.660 回答
1

您的数据框将列存储为numpy数组。但是对于您的计算,将昏暗的n 作为 numpy 数组可能会更有效,因为距离计算可以利用numpy数组操作。

于 2022-02-19T12:41:19.587 回答