python - 如何将 tensorflow-hub 模块与 tensorflow-dataset api 一起使用

Question

我想使用 Tensorflow Dataset api 使用 tensorflow Hub 初始化我的数据集。我想使用 dataset.map 函数将我的文本数据转换为嵌入。我的 TensorFlow 版本是 1.14。

由于我使用了 elmo v2 模块，它将一堆句子数组转换为它们的词嵌入，因此我使用了以下代码：

import tensorflow as tf
import tensorflow_hub as hub
...
sentences_array = load_sentences()
#Sentence_array=["I love Python", "python is a good PL"]
def parse(sentences):
    elmo = hub.Module("./ELMO")
    embeddings = elmo([sentences], signature="default", as_dict=True) 
    ["word_emb"]
    return embeddings
dataset = tf.data.TextLineDataset(sentences_array)
dataset = dataset.apply(tf.data.experimental.map_and_batch(map_func = 
parse, batch_size=batch_size))

我想要嵌入文本数组，如 [batch_size, max_words_in_batch, embedding_size]，但我收到一条错误消息：

"NotImplementedError: Using TF-Hub module within a TensorFlow defined 
 function is currently not supported."

我怎样才能得到预期的结果？

score 2 · Accepted Answer

不幸的是，TensorFlow 1.x 不支持此功能

但是，TensorFlow 2.0 支持它，因此如果您可以升级到 tensorflow 2 并从 tf 2 的可用文本嵌入模块中进行选择（此处的当前列表），那么您可以在dataset管道中使用它。像这样的东西：

embedder = hub.load("https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1")

def parse(sentences):
    embeddings = embedder([sentences])
    return embeddings

dataset = tf.data.TextLineDataset("text.txt")
dataset = dataset.map(parse)

如果您绑定到 1.x 或绑定到 Elmo（我认为新格式尚不可用），那么我可以看到的在预处理阶段嵌入的唯一选项是首先通过简单的嵌入运行您的数据集模型并保存结果，然后将嵌入向量分别用于下游任务。（我很欣赏这不太理想）。

python - 如何将 tensorflow-hub 模块与 tensorflow-dataset api 一起使用

1 回答 1

Related

Reference