0

我们在 Google Dataflow 环境中使用 Python SDK for apache Beam。这个工具很棒,但是我们担心这些工作的隐私问题,因为它看起来像是使用公共 IP 来运行工作人员。我们的问题是:

  • 即使我们指定了网络和子网,我们是否还要担心使用公共 IPS?
  • 限制公共 IP 的性能和安全性究竟有什么区别?
  • 我们如何设置 Dataflow 以在私有 IP 上创建所有工作人员?从理论上讲,在以下模板中,我们将流程设置为不允许该行为(仍然可以)!根据文档

我们的工作模板如下所示:

options = PipelineOptions(flags = ['--requirements_file', './requirements.txt'])

#GoogleCloud options

google_cloud_options = options.view_as(GoogleCloudOptions)

google_cloud_options.project = PROJECT

google_cloud_options.job_name = job_name

google_cloud_options.staging_location = 'gs://{​​​​​​​​}​​​​​​​​/staging'.format(BUCKET)

google_cloud_options.temp_location = 'gs://{​​​​​​​​}​​​​​​​​/temp'.format(BUCKET)

google_cloud_options.region = REGION


#Worker options

worker_options = options.view_as(WorkerOptions)

worker_options.subnetwork = NETWORK

worker_options.max_num_workers = 25


options.view_as(StandardOptions).runner = RUNNER




 ### Note that we specified worker_options.subnetwork with our personal subnetwork. However, once we run our job it still looks like it creates workers on public ips.


### The code runs like this in the end

p = beam.Pipeline(options = options)

...

...

...

run = p.run()

run.wait_until_finish()

谢谢!

4

1 回答 1

2

您还需要传递--no_use_public_ips选项,请参阅https://cloud.google.com/dataflow/docs/guides/specifying-networks#python

于 2021-05-05T00:16:19.900 回答