“tensorflow-transform”的相关标签问题

0 投票

1 回答

1051 浏览

tensorflow - 如何使 tf.Transform（TensorFlow 的 Apache Beam 预处理）工作？

我正在尝试利用tf.Transform lib通过Apache Beam（Google DataFlow）使用TensorFlow进行数据预处理。https://github.com/tensorflow/transform

这是我的设置：

conda create -n tftransform python=2.7 source activate tftransform pip install tensorflow pip install tensorflow-transform pip install dill==0.2.6 git clone https://github.com/tensorflow/transform.git cd transform/ python setup.py install # for good measure ...

然后我尝试执行 simple_example（https://github.com/tensorflow/transform/blob/master/examples/simple_example.py）： python examples/simple_example.py

我收到以下错误： AttributeError: 'DType' object has no attribute 'dtype'

（导入时也有警告No handlers could be found for logger "oauth2client.contrib.multistore_file"）

这是堆栈跟踪： Traceback (most recent call last): File "examples/simple_example.py", line 64, in <module> preprocessing_fn, tempfile.mkdtemp())) File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/apache_beam/transforms/ptransform.py", line 439, in __ror__ result = p.apply(self, pvalueish, label) File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/apache_beam/pipeline.py", line 249, in apply pvalueish_result = self.runner.apply(transform, pvalueish) File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 162, in apply return m(transform, input) File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 168, in apply_PTransform return transform.expand(input) File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/tensorflow_transform/beam/impl.py", line 597, in expand self._output_dir) File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/apache_beam/transforms/ptransform.py", line 439, in __ror__ result = p.apply(self, pvalueish, label) File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/apache_beam/pipeline.py", line 249, in apply pvalueish_result = self.runner.apply(transform, pvalueish) File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 162, in apply return m(transform, input) File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 168, in apply_PTransform return transform.expand(input) File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/tensorflow_transform/beam/impl.py", line 328, in expand self._preprocessing_fn, input_schema) File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/tensorflow_transform/impl_helper.py", line 416, in run_preprocessing_fn inputs = _make_input_columns(schema) File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/tensorflow_transform/impl_helper.py", line 218, in _make_input_columns placeholders = schema.as_batched_placeholders() File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/tensorflow_transform/tf_metadata/dataset_schema.py", line 87, in as_batched_placeholders for key, column_schema in self.column_schemas.items()} File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/tensorflow_transform/tf_metadata/dataset_schema.py", line 87, in <dictcomp> for key, column_schema in self.column_schemas.items()} File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/tensorflow_transform/tf_metadata/dataset_schema.py", line 133, in as_batched_placeholder return self.representation.as_batched_placeholder(self) File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/tensorflow_transform/tf_metadata/dataset_schema.py", line 330, in as_batched_placeholder return tf.placeholder(column.domain.dtype, AttributeError: 'DType' object has no attribute 'dtype'

这个库生产准备好了吗？我怎样才能使这项工作？

2017-03-20T10:31:40.763

0 投票

1 回答

660 浏览

tensorflow - 在 apache 梁中嵌套管道

我正在寻找使用 apache 梁来做以下事情。
专门针对张量流神经网络的预处理。

对于文件夹中的每个文件。
- 对于文件中的每一行
  - 处理线到一维浮点列表

我需要每个返回是每个文件的二维浮点列表。

我想我可以通过创建嵌套管道来实现这一点。
我可以在另一个管道的 ParDo 内创建并运行一个管道。

这似乎效率低下，但我的问题似乎是一个非常标准的用例。

有没有工具可以在 apache Beam 中做得更好？
有没有办法重组我的问题以使其在 apache Beam 中更好地工作？
嵌套管道没有我想象的那么糟糕吗？

谢谢

tensorflow apache-beam tensorflow-transform

2017-04-21T20:51:27.870

0 投票

1 回答

9391 浏览

python - 使用 tensorflow tf-transform 进行数据归一化

我正在使用 Tensorflow 对自己的数据集进行神经网络预测。我做的第一个模型是在我的计算机中使用一个小数据集。在此之后，我稍微更改了代码，以便使用具有更大数据集的 Google Cloud ML-Engine 在 ML-Engine 中实现训练和预测。

我正在对熊猫数据框中的特征进行规范化，但这会引入偏差，并且我得到的预测结果很差。

我真正想要的是使用该库tf-transform来规范化图中的数据。为此，我想创建一个函数preprocessing_fn 并使用' tft.scale_to_0_1'。https://github.com/tensorflow/transform/blob/master/getting_started.md

我发现的主要问题是当我尝试进行预测时。我正在寻找互联网，但我没有找到任何导出模型的示例，其中数据在训练中被标准化。在我发现的所有示例中，数据都没有在任何地方标准化。

我想知道的是，如果我在训练中对数据进行规范化，并发送一个带有新数据的新实例来进行预测，那么这些数据是如何规范化的？

¿ 也许在 Tensorflow 数据管道中？进行标准化的变量保存在某个地方？

总之：我正在寻找一种方法来规范化我的模型的输入，然后新实例也变得标准化。

python tensorflow google-cloud-platform google-cloud-ml tensorflow-transform

2017-09-28T17:02:59.157

0 投票

1 回答

877 浏览

python - Tensorflow GradientBoostedDecisionTreeClassifier 错误：“密集浮点特征必须是矩阵”

python machine-learning tensorflow gcp tensorflow-transform

2017-11-16T23:03:14.147

0 投票

1 回答

638 浏览

tensorflow - tf.data.Dataset 中大量数据集的最佳数据流和处理解决方案

语境：

我的文本输入管道目前由两个主要部分组成：

我。复杂的文本预处理和导出tf.SequenceExamples到 tfrecords（自定义标记化、词汇创建、统计计算、规范化以及整个数据集以及每个单独示例的更多）。对每个数据配置执行一次。

二．一个 tf.Dataset (TFRecords) 管道，在训练期间也进行了大量处理（string_split字符、表查找、分桶、条件过滤等）。

原始数据集存在于多个位置（BigQuery、GCS、RDS...）。

问题：

问题在于，随着生产数据集的快速增长（数 TB），为每个可能的数据配置（第 1 部分有很多超参数）重新创建一个 tfrecords 文件是不可行的，因为每个文件都有数百 TB 的巨大大小。更不用说，当tfrecords 的大小增加时，tf.Dataset读取速度会惊人地减慢。tf.SequenceExamples

有很多可能的解决方案：

Apache Beam + Cloud DataFlow + feed_dict；
tf.变换；
Apache Beam + Cloud DataFlow + tf.Dataset.from_generator；
张量流/生态系统 + Hadoop 或 Spark
tf.contrib.cloud.BigQueryReader

，但以下似乎都不能完全满足我的要求：

流式传输和处理来自 BigQuery、GCS、RDS 等的动态数据，如第一部分所述。
以一种或另一种方式直接发送数据（原型？）以tf.Dataset在第二部分中使用。
快速可靠的训练和推理。
（可选）能够预先计算选定部分数据的一些完整统计数据。
编辑： Python 3 支持会很棒。

tf.data.Dataset管道最合适的选择是什么？在这种情况下，最佳实践是什么？

提前致谢！

tensorflow google-bigquery google-cloud-dataflow tensorflow-datasets tensorflow-transform

2017-12-27T04:41:48.013

0 投票

1 回答

434 浏览

python - 在云 ml 上导入 tf Transform 的问题

每当我尝试在 mlengine 作业上导入 tensorflow-transform 时，都会遇到以下问题：

Tensorflow 转换在数据流上运行良好，但是当我尝试训练模型时出现上述错误。Tensorflow 通常似乎在 mlengine 上运行良好，但如果我尝试仅导入 boosted_trees.python.ops 就会遇到问题。我正在使用 tf 1.4 和 tft 0.4.0。我正在运行的代码是 cloudml-samples reddit_tft 示例的略微修改版本。

python tensorflow google-cloud-ml tensorflow-transform

2018-02-20T18:22:32.387

0 投票

1 回答

751 浏览

python - How to use tf.contrib.estimator.forward_features

I'm trying to use forward_features to get instance keys for cloudml, but I always get errors that I'm not sure how to fix. The preprocessing section that uses tf.Transform is a modification of https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/reddit_tft where the instance key is a string and everything else is a bunch of floats.

If I were to pass in the keys column along side the training features, then I get the error Tensors in list passed to 'values' of 'ConcatV2' Op have types [float32, float32, string, float32, float32, float32, float32, float32, float32, f loat32, float32, float32, float32, float32, float32, float32, float32, float32, float32, float32, float32, float32, float32, float32] that don't all match. However, if I were to not pass in the instance keys during training, then I get the value error saying that the key doesn't exist in the features. Also, if I were to change the key column name in the forward_features section from 'example_id' to some random name that isn't a column, then I still get the former error instead of the latter. Can anyone help me make sense of this?

python tensorflow google-cloud-ml tensorflow-transform

2018-03-07T21:16:59.093

0 投票

1 回答

318 浏览

tensorflow-transform - 来自 trainable_variables 的数据类型错误

我正在运行 tensorflow-transform 并在 trainable_variables 上遇到错误。收到这些消息可以吗？

(cmle-env) debasish:transform debasish.das$ python examples/simple_example.py 2018-03-20 14:36:30.468584: I tensorflow/core/platform/cpu_feature_guard.cc:137] 你的 CPU 支持这个 TensorFlow 二进制文件的指令未编译使用：SSE4.2 AVX AVX2 FMA 错误：tensorflow：无法识别集合 trainable_variables 的数据类型。跳过。错误：tensorflow：无法识别集合 trainable_variables 的数据类型。跳过。错误：tensorflow：无法识别集合 trainable_variables 的数据类型。跳过。错误：tensorflow：无法识别集合 trainable_variables 的数据类型。跳过。WARNING:tensorflow:Expected binary or unicode string, got type_url: "type.googleapis.com/tensorflow.AssetFileDef" value: "\n\t\n\007Const:0\022\033vocab_string_to_int_uniques"

WARNING:tensorflow:Expected binary or unicode string, got type_url: "type.googleapis.com/tensorflow.AssetFileDef" value: "\n\t\n\007Const:0\022\033vocab_string_to_int_uniques"

错误：tensorflow：无法识别集合 trainable_variables 的数据类型。跳过。错误：tensorflow：无法识别集合 trainable_variables 的数据类型。跳过。错误：tensorflow：无法识别集合 trainable_variables 的数据类型。跳过。错误：tensorflow：无法识别集合 trainable_variables 的数据类型。跳过。[{u's_integerized'：0，u'x_centered'：-1.0，u'x_centered_times_y_normalized'：-0.0，u'y_normalized'：0.0}，{u's_integerized'：1，u'x_centered'：0.0，u' x_centered_times_y_normalized'：0.0，u'y_normalized'：0.5}，{u's_integerized'：0，u'x_centered'：1.0，u'x_centered_times_y_normalized'：1.0，u'y_normalized'：1.0}]

tensorflow-transform

2018-03-20T21:39:53.353

0 投票

1 回答

244 浏览

google-cloud-platform - 写入 tensorflow 转换元数据时，管道将在 GCP 上失败

我希望这里有人可以提供帮助。我一直在疯狂地搜索这个错误，但没有找到任何东西。

我有一个在本地执行时可以完美运行的管道，但在 GCP 上执行时会失败。以下是我收到的错误消息。

工作流失败。原因：S03:Write transform fn/WriteMetadata/ResolveBeamFutures/CreateSingleton/Read+Write transform fn/WriteMetadata/ResolveBeamFutures/ResolveFutures/Do+Write transform fn/WriteMetadata/WriteMetadata failed., 一个工作项尝试了 4 次都没有成功。每次工人最终失去与服务的联系。工作项已尝试：

回溯（最后一次调用）：文件“preprocess.py”，第 491 行，在 main() 文件“preprocess.py”，第 487 行，在 main transform_data(args,pipeline_options,runner) 文件“preprocess.py”，行451，在 transform_data eval_data |= 'Identity eval' >> beam.ParDo(Identity()) 文件“/Library/Python/2.7/site-packages/apache_beam/pipeline.py”，第 335 行，退出 self.run().wait_until_finish() 文件“/Library/Python/2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.py”，第 897 行，在 wait_until_finish (self.state, getattr(self._runner, ' last_error_msg'，无）），自我）apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException：数据流管道失败。状态：失败，错误：回溯（最近一次调用最后）：文件“/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py”，第 582 行，在 do_work work_executor.execute() 文件中“ /usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py”，第 166 行，在执行 op.start() 文件“apache_beam/runners/worker/operations.py”，第 294 行，在 apache_beam .runners.worker.operations.DoOperation.start (apache_beam/runners/worker/operations.c:new (cls, *args) TypeError: new () 需要 4 个参数（1 个给定）

有任何想法吗？？

谢谢，

佩德罗

google-cloud-platform google-cloud-dataflow apache-beam dataflow tensorflow-transform

2018-03-29T00:16:19.937

0 投票

0 回答

214 浏览

python - TensorFlow 的预处理数据：InvalidArgumentError

当我运行我的 tensorflow 模型时，我收到了这个错误InvalidArgumentError: Field 4 in record 0 is not a valid float: latency [[Node: DecodeCSV = DecodeCSV[OUT_TYPE=[DT_STRING, DT_STRING, DT_STRING, DT_STRING, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_STRING, DT_STRING, DT_STRING, DT_STRING, DT_STRING, DT_STRING, DT_STRING], field_delim=",", na_value="", use_quote_delim=true](arg0, DecodeCSV/record_defaults_0, DecodeCSV/record_defaults_1, DecodeCSV/record_defaults_2, DecodeCSV/record_defaults_3, DecodeCSV/record_defaults_4, DecodeCSV/record_defaults_5, DecodeCSV/record_defaults_6, DecodeCSV/record_defaults_7, DecodeCSV/record_defaults_8, DecodeCSV/record_defaults_9, DecodeCSV/record_defaults_10, DecodeCSV/record_defaults_11, DecodeCSV/record_defaults_12, DecodeCSV/record_defaults_13)]] [[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[?], [?], [?], [?], [?], [?], [?], [?], [?], [?], [?], [?], [?], [?]], output_types=[DT_STRING, DT_STRING, DT_FLOAT, DT_STRING, DT_STRING, DT_STRING, DT_STRING, DT_STRING, DT_STRING, DT_STRING, DT_FLOAT, DT_FLOAT, DT_STRING, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator)]]

我相信这个问题出在预处理步骤中，该步骤创建了我的模型从中读取数据的 csv 文件，因为它认为它应该接收第一个Node模式，而是获得第二个Node模式。

这是我的预处理代码（我故意从查询中转换 dtypes 以确保它们被正确读取）：

有没有办法在将 dtypes 发送到 Beam 管道之前打印出它，以便我可以检查 dtypes 数组并确保它有效？代码中是否存在导致 dtype 不同或以不同顺序排列然后在CSV_COLUMNS变量中指定的内容？

python apache-beam tensorflow-transform

2018-04-19T01:57:21.450

问题标签 [tensorflow-transform]

Reference