我们在 Google Dataflow 环境中使用 Python SDK for apache Beam。这个工具很棒,但是我们担心这些工作的隐私问题,因为它看起来像是使用公共 IP 来运行工作人员。我们的问题是:
- 即使我们指定了网络和子网,我们是否还要担心使用公共 IPS?
- 限制公共 IP 的性能和安全性究竟有什么区别?
- 我们如何设置 Dataflow 以在私有 IP 上创建所有工作人员?从理论上讲,在以下模板中,我们将流程设置为不允许该行为(仍然可以)!根据文档:
我们的工作模板如下所示:
options = PipelineOptions(flags = ['--requirements_file', './requirements.txt'])
#GoogleCloud options
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = PROJECT
google_cloud_options.job_name = job_name
google_cloud_options.staging_location = 'gs://{}/staging'.format(BUCKET)
google_cloud_options.temp_location = 'gs://{}/temp'.format(BUCKET)
google_cloud_options.region = REGION
#Worker options
worker_options = options.view_as(WorkerOptions)
worker_options.subnetwork = NETWORK
worker_options.max_num_workers = 25
options.view_as(StandardOptions).runner = RUNNER
### Note that we specified worker_options.subnetwork with our personal subnetwork. However, once we run our job it still looks like it creates workers on public ips.
### The code runs like this in the end
p = beam.Pipeline(options = options)
...
...
...
run = p.run()
run.wait_until_finish()
谢谢!