tensorflow - 如何从 TF Hub 获取 Bert 分词器的词汇文件

Question

我正在尝试使用 TensorFlow Hub 中的 Bert 并构建一个标记器，这就是我正在做的事情：

>>> import tensorflow_hub as hub
>>> from bert.tokenization import FullTokenizer

>>> BERT_URL = 'https://tfhub.dev/tensorflow/bert_zh_L-12_H-768_A-12/1'
>>> bert_layer = hub.KerasLayer(BERT_URL, trainable=False)
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1781: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1781: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.

但是现在当我检查已解析对象中的词汇文件时，我得到一个空张量

>>> bert_layer.resolved_object.vocab_file.asset_path.shape
TensorShape([])

获取此词汇文件的正确方法是什么？

score 0 · Accepted Answer

尝试这个：

FullTokenizer = bert.bert_tokenization.FullTokenizer
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1", trainable=False)

vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy() #The vocab file of bert for tokenizer
tokenizer = FullTokenizer(vocab_file)

然后，您可以使用分词器进行分词。

tokenizer.tokenize('Where are you going?')

['w', '##hee', '##re', '是', '你', '去', '?']

您还可以将其他功能传递到您的分词器中。例如：

do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = FullTokenizer(vocab_file, do_lower_case) 
tokenizer.tokenize('Where are you going?')

['你要去哪里'， '？']

tensorflow - 如何从 TF Hub 获取 Bert 分词器的词汇文件

1 回答 1

Related

Reference