“direct-runner”的相关标签问题

0 投票

1 回答

2389 浏览

eclipse - 如何使用直接运行器在 Eclipse 中调试 Dataflow/Apache Beam 管道 DoFn 函数

我想在 Eclipse 中使用直接运行程序运行我的管道，并在我的 DoFn 函数和调试执行中放置一个断点。我尝试通过以下步骤设置直接跑步者：

添加直接runner maven包
在 pom.xml 中为直接运行器设置 maven 配置文件。我的 pom.xml 有这个配置文件

<profiles> <profile> <id>direct-runner</id> <activation> <activeByDefault>true</activeByDefault> </activation> <dependencies> <dependency> <groupId>org.apache.beam</groupId> <artifactId>beam-runners-direct-java</artifactId> <version>0.2.0-incubating</version> </dependency> </dependencies> </profile> </profiles>

我在 pom.xml 的插件管理下有这个maven 插件

<pluginManagement> <plugins> <plugin> <groupId>org.codehaus.mojo</groupId> <artifactId>exec-maven-plugin</artifactId> <version>1.4.0</version> <executions> <execution> <goals> <goal>java</goal> </goals> </execution> </executions> <configuration> <cleanupDaemonThreads>false</cleanupDaemonThreads> <mainClass>com.MyMainClass</mainClass> </configuration> </plugin> </plugins> </pluginManagement>

下面是我的 Eclipse 调试配置的屏幕截图当我使用上述调试配置作业在 GCP 数据流而不是本地 JVM 线程中运行时，我的断点永远不会被命中。

2017-06-14T04:21:50.537

0 投票

2 回答

3215 浏览

apache-beam - 在 Apache Beam 中写入文件

我正在通过 DirectRunner 使用 Apache Beam 在 Windows 中运行 WordCount 程序。我可以看到在临时文件夹（在 src/main/resources/ 下）中创建了输出文件。但是对输出文件的写入却失败了。下面是代码片段：

请让我知道它期望的输出目录/文件的格式提前谢谢

以下是错误：添加异常：由：java.lang.IllegalStateException：无法在 org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:447) at org.apache.beam 找到 i 的注册商.sdk.io.FileSystems.match(FileSystems.java:111) 在 org.apache.beam.sdk.io.FileSystems.matchResources(FileSystems.java:174) 在 org.apache.beam.sdk.io.FileSystems.delete (FileSystems.java:321) 在 org.apache.beam.sdk.io.FileBasedSink$Writer.cleanup(FileBasedSink.java:905) 在 org.apache.beam.sdk.io.WriteFiles$WriteShardedBundles.processElement(WriteFiles.java :376)

apache-beam direct-runner

2017-09-15T15:03:29.277

0 投票

1 回答

184 浏览

google-cloud-platform - 使用 Dataflow DirectRunner 将文件暂存到 GCS

因此，在使用 DataflowRunner 时，我们使用 filesToStage 方法将文件暂存到 GCS，但在 DirectRunner 中不会发生这种情况。有没有办法让 DirectRunner 将文件暂存到 GCS 并使用类似于 DataflowRunner 的文件，也许可以使用 ClassLoader 或其他方法？

google-cloud-platform google-cloud-storage google-cloud-dataflow direct-runner

2018-05-03T14:29:23.700

0 投票

1 回答

575 浏览

python - 带有 DirectRunner 的 Apache Beam (SUBPROCESS_SDK) 仅使用一个工作人员，我如何强制它使用所有可用的工作人员？

以下代码：

仅使用 4 个可用工作程序中的一个，并且仅生成一个大输出文件（即使有许多输入文件）。

如何强制 Beam 管道并行工作，即如何强制每个输入文件由不同的工作人员单独处理？

python apache-beam apache-beam-io direct-runner

2019-11-04T19:01:18.217

0 投票

1 回答

304 浏览

java - JAVA - Apache BEAM- GCP：GroupByKey 与 Direct Runner 一起工作正常，但与 Dataflow runner 一起失败

我使用 Dataflow 运行程序测试了我的代码，但是它返回了一个错误：

请注意，我在 Direct Runner 中使用了相同的代码，它工作得很好。有没有人遇到过这个问题？如果是这样，你能告诉我如何解决吗？或者我应该用另一个函数替换 GroupByKey ...？

这是代码：

Apache Beam 版本：2.17.0

java google-cloud-dataflow apache-beam dataflow direct-runner

2020-03-11T15:52:55.987

0 投票

0 回答

35 浏览

apache-spark - DirectRunner 火花模式下的内存分析

我正在使用 YourKit 进行内存分析，为了简化 Spark 应用程序的问题，我在 DirectRunner 模式下运行该应用程序。我正在测试的机器有 32 个内核。捕获的快照如下所示：

“direct-runner-worker”有 32 个线程，我似乎错误地假设直接运行器只占用一个线程。我的问题是 - 并行化线程的数量不应该有限制吗？在快照中，线程占用 250 到 350 MB，这将不可避免地崩溃。

另一个问题是我不确定我是否应该遵循http://spark.apache.org/developer-tools.html#profiling对于我的情况，文档似乎是针对使用 SparkCluster 运行的应用程序，但因为我使用的是 DirectRunner （出于调试目的）那么也许我所做的一切都足够好 - 有没有人有这方面的经验？

任何指针表示赞赏！:)

PS：我对 2.15 亿个对象的创建感到震惊，但这应该会随着线程数的增加而下降。但是，每个线程约 600 万个对象似乎很多。

apache-spark heap-memory memory-profiling yourkit direct-runner

2020-05-22T02:13:01.767

0 投票

1 回答

140 浏览

java - DirectOptions 类中缺少选项

文档提到了以下选项：以及direct_num_workers设置direct_running_mode选项streaming。

DirectOptions 类中缺少所有这些

此外，当尝试从args以下异常中设置它们时：

有人设法使用这些吗？如何？

java apache-beam direct-runner

2020-10-03T19:22:50.027

0 投票

0 回答

141 浏览

python - GCP Dataflow 管道在 DirectRunner 中的运行速度比 DataflowRunner 快

我对使用 Dataflow (GCP) 很陌生。我建了一个在 DirectRunner 模式下运行比 DataflowRunner 模式更快的管道，我不知道如何改进它。该管道从 Bigquery 中的多个表中读取数据并返回一个 csv 文件，它接收日期作为执行参数来过滤查询：

python google-cloud-platform pipeline dataflow direct-runner

2021-01-05T14:43:59.790

0 投票

1 回答

345 浏览

python - 带有 Cloud Pub/Sub 的 Apache Beam DirectRunner

我正在尝试将数据从 Cloud Pub/Sub 传递到 Google Cloud Storage。当我使用 runnerDataflowRunner时，管道会发布到 Google Cloud Dataflow 并按预期工作。但是，对于某些测试，我希望管道在本地运行（但仍从 Cloud Pub/Sub 读取并写入 Cloud Storage）。当我使用运行程序 DirectRunner 时，该进程会写出INFO:apache_beam.runners.direct.direct_runner:Running pipeline with DirectRunner.，但是当新消息发布到 Pub/Sub 时什么也不做。

我正在使用以下命令执行管道：

完整的 dev_radim_dataflow_gcs_direct.py 文件在这里：https ://pastebin.com/W7VphH5A

任何想法为什么消息不能从 Pub/Sub 到 GCS？

python google-cloud-platform apache-beam direct-runner

2021-02-23T06:51:55.487

0 投票

1 回答

223 浏览

python - 错误：找不到满足 grpcio<2,>=1.29.0 要求的版本（来自 apache-beam[gcp]）

我在 Dataflow 中执行 Apache Beam 管道时遇到问题（使用 DirectRunner）。我有一个requirements.txt文件，其中包含 apache-beam[gcp] 以及其他库。

以下是错误 TraceBack：

最后，我收到以下错误：

我使用 pip 21 和 python 3.8

有没有人已经遇到过类似的问题，并找到了解决方法？

谢谢

python google-cloud-dataflow apache-beam direct-runner grpcio

2021-08-04T12:14:14.543

问题标签 [direct-runner]

Reference