0

我有一个类似以下但更大的数据框:

import pandas as pd

data = {'First':  ['First value','Third value','Second value','First value','Third value','Second value'],
        'Second': ['the old man is here','the young girl is there', 'the old woman is here','the  young boy is there','the young girl is here','the old girl is here']}

df = pd.DataFrame (data, columns = ['First','Second'])

我已经计算了基于第一列的每个可能对之间的平均相似度,如下所示(从stackoverflow中的其他答案中获得了这部分的帮助):

from itertools import combinations
#function to calculate similarity between each pairs of documents 
def similarity_measure(doc1, doc2): 

    words_doc1 = set(doc1) 
    words_doc2 = set(doc2)

    intersection = words_doc1.intersection(words_doc2)
    union = words_doc1.union(words_doc2)
    
    return float (len(intersection)) / len(union) * 100

    #getting the lemmatized text along side the intents
    data_similarity= df.groupby('First')['Second'].apply(lambda x:  nltk.tokenize.word_tokenize(' '.join(x)))
     data_similarity = data_similarity.reset_index()

   #returning the similarity measures for each pair in the dataset
    for val in list(combinations(range(len(data_similarity)), 2)):
         print(f"similarity between {data_similarity.iloc[val[0],0]} and {data_similarity.iloc[val[1],0]} intents is: {similarity_measure(data_similarity.iloc[val[0],1],data_similarity.iloc[val[1],1])}")

我想要作为输出的是所有对的平均值,例如,如果上面的代码具有以下输出:

similarity between first value and second value is 60
similarity between first value and third value is 50 
similarity between second value and third value is 55
similarity between second value and first value is 60
similarity between third value and first value is 50
similarity between third value and second value is 55

我想有任何组合的第一个值的平均分数,任何组合的第二个值,以及任何组合的第三个值,如下所示:

first value average across all possible values is 55
second value average across all possible values is 57.5
third value average across all possible values is  52.5
4

1 回答 1

1

编辑:根据您的评论,您可以执行以下操作。

  1. 首先计算data_similarity表,该表将组中不同句子的标记组合在一起。
  2. 计算句子之间的成对相似度元组
  3. 将它们放入数据框中,然后按整个组分组并取平均值。
import nltk
from itertools import combinations, product

#function to calculate similarity between each pairs of documents 
def similarity_measure(doc1, doc2): 

    words_doc1 = set(doc1) 
    words_doc2 = set(doc2)

    intersection = words_doc1.intersection(words_doc2)
    union = words_doc1.union(words_doc2)
    
    return float (len(intersection)) / len(union) * 100

#getting the lemmatized text along side the intents
data_similarity= df.groupby('First')['Second'].apply(lambda x:  nltk.tokenize.word_tokenize(' '.join(x)))
data_similarity = data_similarity.reset_index()

all_pairs = [(i,l,similarity_measure(j,m)) for (i,j),(l,m) in 
             product(zip(data_similarity['First'], data_similarity['Second']), repeat=2) if i!=l]

pair_similarity = pd.DataFrame(all_pairs, columns=['A','B','Similarity'])
group_similarity = pair_similarity.groupby(['A'])['Similarity'].mean().reset_index()
print(group_similarity)
              A  Similarity
0   First value   47.777778
1  Second value   45.000000
2   Third value   52.777778
于 2021-01-21T11:55:07.480 回答