我正在尝试进行语义搜索,但预训练模型在意大利杂货数据上并不准确。
例如。
Query: latte al cioccolato #chocolate milk
Top 3 most similar sentences in the corpus:
Milka cioccolato al latte 100 g (Score: 0.7714) #Milka milk chocolate 100 g
Alpro, Cioccolato bevanda a base di soia 1 ltr (Score: 0.5586) #Alpro, Chocolate soy drink 1 ltr(soya milk)
Danone, HiPRO 25g Proteine gusto cioccolato 330 ml (Score: 0.4569) #Danone, HiPRO 25g Protein chocolate flavor 330 ml(protein chocolate milk)
在上面的例子中,问题是预训练的 BERT 模型没有返回上下文相似度。结果应按以下顺序排列。
预期结果:
Query: latte al cioccolato #chocolate milk
Top 3 most similar sentences in the corpus:
Alpro, Cioccolato bevanda a base di soia 1 ltr (Score: 0.99) #Alpro, Chocolate soy drink 1 ltr(soya milk)
Danone, HiPRO 25g Proteine gusto cioccolato 330 ml (Score: 0.95) #Danone, HiPRO 25g Protein chocolate flavor 330 ml(protein chocolate milk)
Milka cioccolato al latte 100 g (Score: 0.40) #Milka milk chocolate 100 g
微调尝试:
!pip install sentence-transformers
import scipy
import numpy as np
from sentence_transformers import models, SentenceTransformer
model = SentenceTransformer('distiluse-base-multilingual-cased') # workes with Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish
#Fine-Tuning
import pandas as pd
df = pd.DataFrame({
"message":[
"latte al cioccolato" ,
"Alpro, Cioccolato bevanda a base di soia 1 ltr ", #Alpro, Chocolate soy drink 1 ltr
"Milka cioccolato al latte 100 g", #Milka milk chocolate 100 g
"Danone, HiPRO 25g Proteine gusto cioccolato 330 ml", #Danone, HiPRO 25g Protein chocolate flavor 330 ml
],
"lbl":["liquid","liquid","chocolate","liquid"]
})
df
X=list(df['message'])
y=list(df['lbl'])
y=list(pd.get_dummies(y,drop_first=True)['liquid'])
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("kiri-ai/distiluse-base-multilingual-cased-et")
encodings = tokenizer(X, truncation=True, padding=True)
import tensorflow as tf
train_dataset = tf.data.Dataset.from_tensor_slices((
dict(encodings),
y
))
from transformers import AutoModel, TFTrainer, TFTrainingArguments
training_args = TFTrainingArguments(
output_dir='./results', # output directory
num_train_epochs=2, # total number of training epochs
per_device_train_batch_size=8, # batch size per device during training
per_device_eval_batch_size=16, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
logging_steps=10,
)
with training_args.strategy.scope():
model = AutoModel.from_pretrained("kiri-ai/distiluse-base-multilingual-cased-et")
trainer = TFTrainer(
model=model, # the instantiated Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset # training dataset
)
trainer.train()