python - 通过 huggingface 构建连体网络 --- 分别使用 huggingface 数据集和转换器以及 tensorflow 标记两个句子

Question

我目前正在使用预训练的 Bert 模型构建一个连体网络，该模型采用'input_ids','token_type_ids'和来自变形金刚。我有一个结构为的数据集，所以我必须分别标记问题。'attention_mask'inputsquestion1, question2, label

def tokenize(ds):
    q1=datasets.Sequence(tokenizer(ds['question1'], padding='max_length', truncation=True, max_length=128))
    q2=datasets.Sequence(tokenizer(ds['question2'], padding='max_length', truncation=True, max_length=128))
    return {"q1":q1,"q2":q2}
dataset_tokenized = dataset.map(tokenize)

该过程已经进行到一半，直到它尝试将结果转换为pyarrow并引发错误：

ArrowInvalid: Could not convert Sequence(feature={'input_ids':[too_long_to_show], 'token_type_ids':[too_long_to_show],'attention_mask':[too_long_to_show]}, length=-1, id=None) with type Sequence: did not recognize Python value type when inferring an Arrow data type

在“Flatten”部分的官方文档datasets.Dataset中，似乎数据集可以Sequence作为其特征。

我想建立这样的网络

class Siamese(Model, ABC):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.transformer = TFBertModel.from_pretrained("hfl/chinese-bert-wwm")

    def call(self, inputs, training=None, mask=None):
        y1 = self.transformer(inputs[some_indices])
        y2 = self.transformer(inputs[some_indices])
        y1 = y1.get('last_hidden_state')
        y2 = y2.get('last_hidden_state')
        dist = tf.keras.losses.cosine_similarity(y1,y2)
        return dist

问题：

如何安排数据集和模型以适合数据？

python - 通过 huggingface 构建连体网络 --- 分别使用 huggingface 数据集和转换器以及 tensorflow 标记两个句子

问题：

0 回答 0

Related

Reference