我的数据看起来像这些
id1,id2,similarity
CHEMBL1,CHEMBL1,1
CHEMBL2,CHEMBL1,0.18
CHEMBL3,CHEMBL1,0.56
CHEMBL4,CHEMBL1,0.64
CHEMBL5,CHEMBL1,0.12
CHEMBL1,CHEMBL2,0.18
CHEMBL2,CHEMBL2,1
CHEMBL3,CHEMBL2,0.26
CHEMBL4,CHEMBL2,0.78
CHEMBL5,CHEMBL2,0.33
CHEMBL1,CHEMBL3,0.56
CHEMBL2,CHEMBL3,0.26
CHEMBL3,CHEMBL3,1
CHEMBL4,CHEMBL3,0.04
CHEMBL5,CHEMBL3,0.85
CHEMBL1,CHEMBL4,0.64
CHEMBL2,CHEMBL4,0.78
CHEMBL3,CHEMBL4,0.04
CHEMBL4,CHEMBL4,1
CHEMBL5,CHEMBL4,0.49
CHEMBL1,CHEMBL5,12
CHEMBL2,CHEMBL5,0.33
CHEMBL3,CHEMBL5,0.85
CHEMBL4,CHEMBL5,0.49
CHEMBL5,CHEMBL5,1
整个文件大约有 1.97 亿行 (10GB)。我的目标是比较第 1 列中每种化合物的第 3 列的分布。经过大量重构,我设法获得了这段代码
import pandas as pd
from scipy.stats import ks_2samp
import re
with open('example.csv', 'r') as f, open('Metrics.tsv', 'a') as f_out:
f_out.write('compound_1' + '\t' + 'compound_2' + '\t' + 'Similarity' + '\t' + 'KS Distance' + '\n')
df = pd.read_csv(f, delimiter = ',', lineterminator = '\n', header = None)
d = {}
l_id1 = []
l_id2 = []
l_sim = []
uniq_comps = df.iloc[:, 0].unique().tolist()
for i in uniq_comps:
d[i] = []
for j in range(df.shape[0]):
d[df.iloc[j, 0]].append(df.iloc[j, 2])
l_id1.append(df.iloc[j, 0])
l_id2.append(df.iloc[j, 1])
l_sim.append(df.iloc[j, 2])
for k in range(len(l_id1)):
sim = round(l_sim[k]*100, 0)/100
ks = re.findall(r"statistic=(.*)\,.*$", str(ks_2samp(d[l_id1[k]], d[l_id2[k]])))
f_out.write(l_id1[k] + '\t' + l_id2[k] + '\t' + str(sim) + '\t' + str(''.join(ks)) + '\n')
它运行但正如预期的那样非常慢。有没有人知道如何让它更快?我想要的输出看起来像这样
compound_1,compound_2,Similarity,KS Distance
CHEMBL1,CHEMBL1,1.0,0.0
CHEMBL2,CHEMBL1,0.18,0.4
CHEMBL3,CHEMBL1,0.56,0.2
CHEMBL4,CHEMBL1,0.64,0.2
CHEMBL5,CHEMBL1,0.12,0.4
CHEMBL1,CHEMBL2,0.18,0.4
CHEMBL2,CHEMBL2,1.0,0.0
CHEMBL3,CHEMBL2,0.26,0.2
CHEMBL4,CHEMBL2,0.78,0.4
CHEMBL5,CHEMBL2,0.33,0.2
CHEMBL1,CHEMBL3,0.56,0.2
CHEMBL2,CHEMBL3,0.26,0.2
CHEMBL3,CHEMBL3,1.0,0.0
CHEMBL4,CHEMBL3,0.04,0.2
CHEMBL5,CHEMBL3,0.85,0.2
CHEMBL1,CHEMBL4,0.64,0.2
CHEMBL2,CHEMBL4,0.78,0.4
CHEMBL3,CHEMBL4,0.04,0.2
CHEMBL4,CHEMBL4,1.0,0.0
CHEMBL5,CHEMBL4,0.49,0.2
CHEMBL1,CHEMBL5,12.0,0.4
CHEMBL2,CHEMBL5,0.33,0.2
CHEMBL3,CHEMBL5,0.85,0.2
CHEMBL4,CHEMBL5,0.49,0.2
CHEMBL5,CHEMBL5,1.0,0.0
由于数据的大小,在 Pyspark 中运行它会更明智吗?如果是这样,如何达到类似的效果?