1

我正在尝试进行语义搜索,但预训练模型在意大利杂货数据上并不准确。

例如。

Query: latte al cioccolato  #chocolate milk

Top 3 most similar sentences in the corpus:
Milka  cioccolato al latte 100 g (Score: 0.7714)   #Milka milk chocolate 100 g
Alpro, Cioccolato bevanda a base di soia 1 ltr (Score: 0.5586)  #Alpro, Chocolate soy drink 1 ltr(soya milk)
Danone, HiPRO 25g Proteine gusto cioccolato 330 ml (Score: 0.4569) #Danone, HiPRO 25g Protein chocolate flavor 330 ml(protein chocolate milk) 

在上面的例子中,问题是预训练的 BERT 模型没有返回上下文相似度。结果应按以下顺序排列。

预期结果:

Query: latte al cioccolato  #chocolate milk

Top 3 most similar sentences in the corpus:
Alpro, Cioccolato bevanda a base di soia 1 ltr (Score: 0.99)  #Alpro, Chocolate soy drink 1 ltr(soya milk)
Danone, HiPRO 25g Proteine gusto cioccolato 330 ml (Score: 0.95) #Danone, HiPRO 25g Protein chocolate flavor 330 ml(protein chocolate milk)
Milka  cioccolato al latte 100 g (Score: 0.40)   #Milka milk chocolate 100 g

微调尝试:

!pip install sentence-transformers
import scipy
import numpy as np
from sentence_transformers import models, SentenceTransformer
model = SentenceTransformer('distiluse-base-multilingual-cased') # workes with Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish


#Fine-Tuning 
import pandas as pd
df = pd.DataFrame({
    "message":[
          "latte al cioccolato"  ,
          "Alpro, Cioccolato bevanda a base di soia 1 ltr ", #Alpro, Chocolate soy drink 1 ltr
          "Milka  cioccolato al latte 100 g", #Milka milk chocolate 100 g
          "Danone, HiPRO 25g Proteine gusto cioccolato 330 ml", #Danone, HiPRO 25g Protein chocolate flavor 330 ml
         ],
    "lbl":["liquid","liquid","chocolate","liquid"]
})
df


X=list(df['message'])
y=list(df['lbl'])


y=list(pd.get_dummies(y,drop_first=True)['liquid'])


from transformers import AutoTokenizer, AutoModel
  
tokenizer = AutoTokenizer.from_pretrained("kiri-ai/distiluse-base-multilingual-cased-et")
encodings = tokenizer(X, truncation=True, padding=True)


import tensorflow as tf

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(encodings),
    y
))



from transformers import AutoModel, TFTrainer, TFTrainingArguments

training_args = TFTrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=2,              # total number of training epochs
    per_device_train_batch_size=8,  # batch size per device during training
    per_device_eval_batch_size=16,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)



with training_args.strategy.scope():
    model = AutoModel.from_pretrained("kiri-ai/distiluse-base-multilingual-cased-et")

trainer = TFTrainer(
    model=model,                         # the instantiated  Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset         # training dataset
)

trainer.train()
4

0 回答 0