huggingface-transformers - 从 tokenizer.encode_plus 返回的字典中缺少 attention_mask

Question

我有一个运行良好的代码库，但是今天当我尝试运行时，我观察到它tokenizer.encode_plus停止返回attention_mask。是否在最新版本中删除？或者，我需要做其他事情吗？

以下代码对我有用。

encoded_dict = tokenizer.encode_plus(
                truncated_query,
                span_doc_tokens,
                max_length=max_seq_length,
                return_overflowing_tokens=True,
                pad_to_max_length=True,
                stride=max_seq_length - doc_stride - len(truncated_query) - sequence_pair_added_tokens,
                truncation_strategy="only_second",
                return_token_type_ids=True,
                return_attention_mask=True
            )

但现在，我只能dict_keys(['input_ids', 'token_type_ids'])从 encode_plus 获得。另外，我意识到返回input_ids的没有填充到max_length.

score 0 · Accepted Answer

我弄清楚了这个问题。我将 tokenizers API 更新为 0.7.0，这是最新版本。但是，最新版本的转换器 API 适用于标记器 0.5.2 版本。回滚到 0.5.2 后，问题就消失了。随着pip show，我看到以下内容。

Name: transformers
Version: 2.8.0
Summary: State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch
Home-page: https://github.com/huggingface/transformers

Name: tokenizers
Version: 0.5.2
Summary: Fast and Customizable Tokenizers
Home-page: https://github.com/huggingface/tokenizers

huggingface-transformers - 从 tokenizer.encode_plus 返回的字典中缺少 attention_mask

1 回答 1

Related

Reference