我正在寻找使用余弦相似度来计算熊猫数据框列之间的相似度。我有 6 个文本列分为 2 个部分,前 3 个列是第一部分 [textA,textB,textC],其余在第二个部分 [text1,text2,text3]。我必须将 sec1 中的每一列与 sec2 的所有列进行比较,并根据通过创建单独的列找到或未找到的匹配返回匹配项、相似性分数和真或假。
试图通过使用下面的代码来实现这一点,但无法完成它与如何对列进行矢量化和计算相似度,有人可以在这方面指导我吗,
count_vectorizer = TfidfVectorizer(stop_words='english')
sparse_matrix = count_vectorizer.fit_transform(data[[]])
doc_term_matrix = sparse_matrix.todense()
df = pd.DataFrame(doc_term_matrix,
columns=count_vectorizer.get_feature_names(),
index=[data.iloc[0:, :]])
list_1 = ['Text A', 'Text B', 'Text C' ]
list_2 = ['Text 1', 'Text 2', 'Text 3']
list_entity = []
list_best_name = []
for col in df.columns:
#print(col)
for col1 in list_1:
if col1 in col:
first_list.append(col)
for col2 in list_2:
if col2 in col:
next_list.append(col)
first_list, next_list
def lets_match(x):
for text1 in next_list:
for text2 in first_list:
try:
if x[text1] in x[text2]:
return True
except:
continue
return False
df['output'] = df.apply(lets_match,axis =1)
print(df)
预计输出如下数据的最后 3 列。
下面是csv格式的数据,
Text A, Text B, Text C, Text 1, Text 2, Text 3, Match, Similirity Score, Result
SIDDIS JEWELS INDIA LLP, SANJAY SHRESTHA, [FINANCIAL DEPARTMENT,HOTEL TAJ TASHI, BHUTAN], MEGA INTERNATIONAL COMMERCIAL BANK, SANJAY SHRESTHA, [LION LIMITED,FLAT/RMA5,9/F SILVERCORP INTERNATIONAL TOWER], SANJAY SHRESTHA, 0.53, TRUE
T BANK LIMITED, PUNJAB NATIONAL BANK, [FINANCIAL DEPARTMENT,HOTEL TAJ TASHI, BHUTAN], KINGXIN INTERNATIONAL TRADE CO, PUNJAB BANK, [SILVERCORP INTERNATIONAL TOWER, HONG KONG], , 0.67, FALSE
MEGA INTERNATIONAL COMMERCIAL BANK, SANJAY SHRESTHA, [LION LIMITED,FLAT/RMA5,9/F CORP INTERNATIONAL TOWER], SIDDIS JEWELS INDIA LLP, France, [MCC COMPLEX BUILDING, OPPOSITE TO HOTEL TAJ TASHI], , 0.53, FALSE
SIDDIS JEWELS INDIA LLP, Italy, [MCC COMPLEX BUILDING, OPPOSITE TO HOTEL TAJ TASHI], SIDDIS JEWELS INDIA LLP, Anil Kumar, [CORP INTERNATIONAL TOWER, HONG KONG], SIDDIS JEWELS INDIA LLP, 0.34, TRUE
BABA DAWOO COMMERCIAL VEHICLES, Syrian Arab Republic, [CORP NATIONAL TOWER, HONG KONG], T BANK LIMITED, Syria, [CORP INTERNATIONAL TOWER, HONG KONG], Syria, 0.95, TRUE
T BANK LIMITED, UAE, [FINANCIAL DEPARTMENT,HOTEL TAJ TASHI, BHUTAN], KINGXIN INTERNATIONAL TRADE CO, Neerav Modi, [MCC COMPLEX BUILDING, OPPOSITE TO HOTEL TAJ TASHI], , 0.83, FALSE
ANDANI GLOBAL PTE LTD, North Korea, [LION LIMITED,FLAT/RMA5,9/F CORP INTERNATIONAL TOWER], NTS (ASIA PACIFIC) PTE LTD, North Korea, [MCC COMPLEX BUILDING, OPPOSITE TO HOTEL TAJ TASHI], North Korea, 0.53, TRUE
KINGXIN INTERNATIONAL TRADE CO, Neerav Modi, [MCC COMPLEX BUILDING, OPPOSITE TO HOTEL TAJ TASHI], ADANI GLOBAL FZE, Syria, [CORP INTERNATIONAL TOWER, HONG KONG], , 0.67, FALSE
AMIAN DIAMONDS NV, Vijay Malya, [CORP INTERNATIONAL TOWER, HONG KONG], AMIAN DIAMONDS NV, Vijay Malya, [FINANCIAL DEPARTMENT,HOTEL TAJ TASHI, BHUTAN], AMIAN DIAMONDS NV, Vijay Malya , 0.53, TRUE
AMIAN DIAMONDS NV, Mohammad Ali, [LION LIMITED,FLAT/RMA5,9/F CORP NATIONAL TOWER], ANDANI GLOBAL FZE, Ali Mohammad, [LION LIMITED,FLAT/RMA5,9/F CORP INTERNATIONAL TOWER], Ali Mohammad, 0.95, TRUE
NET ELECTRONICS L L C, Iran, [MCC COMPLEX BUILDING, OPPOSITE TO HOTEL TAJ TASHI], AMIAN DIAMONDS NV, Iran, [FINANCIAL DEPARTMENT,HOTEL TAJ TASHI, BHUTAN], Iran, 0.83, TRUE
GEGA INTERNATIONAL COMMERCIAL BANK, Rajendra Nagar, [CORP INTERNATIONAL TOWER, HONG KONG], SIDDIS JEWELS INDIA LLP, Rajendra Nagar, [CORP INTERNATIONAL ,HONG KONG], Rajendra Nagar, CORP INTERNATIONAL ,HONG KONG, 0.83, TRUE