python - 从相似度 numpy.ndarray 中获取 top-K 相关文档

Question

我正在使用此处定义的文档相似性。

我的问题是如何从numpy.ndarrayIs there a way to sort the numpy array and get the top-K related documents that are similar 中获取最相关的文档？

这是示例代码。

from sklearn.feature_extraction.text import TfidfVectorizer

poem = ["All the world's a stage",
"And all the men and women merely players",
"They have their exits and their entrances",
"And one man in his time plays many parts",
"His acts being seven ages. At first, the infant",
"Mewling and puking in the nurse's arms",
"And then the whining school-boy, with his satchel",
"And shining morning face, creeping like snail",
"Unwillingly to school. And then the lover",
"Sighing like furnace, with a woeful ballad",
"Made to his mistress' eyebrow. Then a soldier",
"Full of strange oaths and bearded like the pard",
"Jealous in honour, sudden and quick in quarrel",
"Seeking the bubble reputation",
"Even in the cannon's mouth. And then the justice",
"In fair round belly with good capon lined",
"With eyes severe and beard of formal cut",
"Full of wise saws and modern instances",
"And so he plays his part. The sixth age shifts",
"Into the lean and slipper'd pantaloon",
"With spectacles on nose and pouch on side",
"His youthful hose, well saved, a world too wide",
"For his shrunk shank; and his big manly voice",
"Turning again toward childish treble, pipes",
"And whistles in his sound. Last scene of all",
"That ends this strange eventful history",
"Is second childishness and mere oblivion",
"Sans teeth, sans eyes, sans taste, sans everything"]


vect = TfidfVectorizer(min_df=1)
tfidf = vect.fit_transform(poem) 

result = (tfidf * tfidf.T).A

print(type(result))

print(result)

score 1 · Accepted Answer

将 diag 元素设置为零，然后用于argsort()在扁平数组中查找前 K 个索引，并使用unravel_index()将 1D 索引转换为 2D 索引：

result[np.diag_indices_from(result)] = 0.0
idx = np.argsort(result, axis=None)[-10:]
midx = np.unravel_index(idx, result.shape)
print midx
print result[midx]

结果：

(数组([ 8, 14, 1, 0, 11, 17, 8, 10, 6, 8]), 数组([14, 8, 0, 1, 17, 11, 10, 8, 8, 6]) ) [ 0.2329741 0.2329741 0.2379527 0.2379527 0.25723394 0.25723394 0.26570327 0.26570327 0.34954834 0.34954834]

python - 从相似度 numpy.ndarray 中获取 top-K 相关文档

1 回答 1

Related

Reference