2

通过 TensorFlow Transform,我们可以使用 Apache Beam 预处理数据。设置此类管道时的要求之一是定义一个DatasetMetadata对象,该对象包含具有将数据从其磁盘或内存格式解析为张量所需的信息的模式。

在官方文档中,我们给出了一个表单示例:

raw_data_metadata = dataset_metadata.DatasetMetadata(
dataset_schema.from_feature_spec({
    's': tf.FixedLenFeature([], tf.string),
    'y': tf.FixedLenFeature([], tf.float32),
    'x': tf.FixedLenFeature([], tf.float32),
}))

如果您的原始数据是以下形式的字典,这一切都很好:

{
    's': 'example string',
    'y': 32.0,
    'x': 35.0
}

但是,在为SequenceExample定义模式时,我有些迷茫。更具体地说,考虑我的数据具有以下格式:

{
    # context features
    'length': 5,
    # sequence features
    'tokens': [
        {
            'raw': 'The',
            'ner-tag': 'O'
        },
        {
            'raw': 'European',
            'ner-tag': 'B-org'
        },
        {
            'raw': 'Union',
            'ner-tag': 'I-org'
        },
        {
            'raw': 'is',
            'ner-tag': 'O'
        },
        {
            'raw': 'nice',
            'ner-tag': 'O'
        }
        ...
    ]
}

上面我有一个带有 2 个序列的句子:

  • ner-tag序列,将用作模型的标签
  • 将用作模型特征的原始序列

如何为此类示例创建 TFT 数据模式?

这个文档有点缺席。非常感谢任何帮助!

4

1 回答 1

1

好吧,经过更多研究,答案是你不能。

TensorFlow Transform 尚不支持 SequenceExample(s)。检查这个

目前看来,这样做的唯一方法是让 Beam 管道创建 SequenceExample,将它们序列化并将它们写入 TFRecords。

鉴于上述句子对象结构,您需要首先创建一个 Beam DoFn,将每个句子转换为序列化的 SequenceExample:

class ConvertJSONSentenceToSerializedSequenceExample(beam.DoFn):

    def make_example(self, sentence):
        # the context features
        sentence_level_details = tf.train.Features(feature={
            'length': tf.train.Feature(int64_list=tf.train.Int64List(value=[sentence['length']]))
        })

        # create sequence data
        word_features = []
        ner_tags_features = []
        for token in sentence['tokens']:
            # create each of the features, then add them to the corresponding feature list
            word_feature = tf.train.Feature(bytes_list=tf.train.BytesList(value=[token['raw'].encode('utf-8')]))
            word_features.append(word_feature)

            ner_tag_feature = tf.train.Feature(int64_list=tf.train.Int64List(value=[token['']]))
            ner_tags_features.append(ner_tag_feature)

        words = tf.train.FeatureList(feature=word_features)
        ner_tags = tf.train.FeatureList(feature=ner_tags_features)

        sentence_sequences = tf.train.FeatureLists(feature_list={
            'words': words,
            'ner-tags': ner_tags
        })

        ex = tf.train.SequenceExample(
            context = sentence_level_details,
            feature_lists = sentence_sequences
        )

        return ex

    def process(self, sentence, **kwargs):
        try:
            ex = self.make_example(sentence)
            yield ex.SerializeToString()
        except Exception as e:
            logging.warning("JSON sentence could not be converted into SequenceExample: " + str(e))
            return None

完成后,您可以使用beam.io.tfrecordio模块将这些序列化的 SequenceExample(s) 转换为 TFRecord(s) 文件:

with beam.Pipeline(RUNNER, options=opts) as p:
(p
...
| 'Convert sentences to serialized TensorFlow SequenceExamples' >> beam.ParDo(ConvertJSONSentenceToSerializedSequenceExample())
| 'Write to TFRecord files' >> tfrecordio.WriteToTFRecord(
     os.path.join(OUTPUT_DIR, 'train'),
     file_name_suffix='.gz'
     # default coder is the BytesCoder, which will work since we have serialized the training data
)
于 2020-01-12T13:06:31.567 回答