我正在使用 Numba 来提高以下循环的速度。没有 Numba 执行需要 135 秒,使用 Numba 需要 0.30 秒 :) 非常快。
在下面的循环中,我将数组与阈值 0.85 进行比较。如果条件结果为真,我将数据插入到函数将返回的列表中。
插入到列表中的数据如下所示。
['Source ID', 'Source TEXT', 'Similar ID', Similar TEXT, 'Score']
idd = df['ID'].to_numpy()
txt = df['TEXT'].to_numpy()
Column = 'TEXT'
df = preprocessing(dataresult, Column) # removing special characters of 'TEXT' column
message_embeddings = model_url(np.array(df['DescriptionNew'])) #passing df to universal sentence encoder model to create sentence embedding.
cos_sim = cosine_similarity(message_embeddings) #len(cos_sim) > 8000
# Below function finds duplicates amoung rows.
@numba.jit(nopython=True)
def similarity(nid, txxt, cos_sim, threshold):
numba_list = List()
for i in range(cos_sim.shape[0]):
for index in range(i, cos_sim.shape[1]):
if (cos_sim[i][index] > threshold) & (i!=index):
numba_list.append([nid[i], nid[index], cos_sim[i][index]]) # either this works
# numba_list.append([txxt[i], txxt[index]]) # or either this works
# numba_list.append([nid[i], txxt[i], nid[index], txxt[index], cos_sim[i][index]]) # I want this to work.
return numba_list
print(similarity(idd, txt, cos_sim, 0.85))
在附加列表期间的上述代码中,要么附加带有数字的列,要么附加文本。我希望所有带有数字和文本的列都插入到numba_list
.
我低于错误
1 frames
/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py in error_rewrite(e, issue_type)
359 raise e
360 else:
--> 361 raise e.with_traceback(None)
362
363 argtypes = []
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Poison type used in arguments; got Poison<LiteralList((int64, [unichr x 12], int64, [unichr x 12], float32))>
During: resolving callee type: BoundFunction((<class 'numba.core.types.containers.ListType'>, 'append') for ListType[undefined])
During: typing of call at <ipython-input-179-6ee851edb6b1> (14)
File "<ipython-input-179-6ee851edb6b1>", line 14:
def zero(nid, txxt, cos_sim, threshold):
<source elided>
# print(i+1)
numba_list.append([nid[i], txxt[i], nid[index], txxt[index], cos_sim[i][index]])
^