python - Parallelize loop operation with Google Data Flow

Question

I want to run an ARIMA Model with thousands of CSV files with various combination

Using Pyflux Here is some python code..

index =0
#filename has file names of thousands of files
for csvfile in filenames:
data = pd.read_csv(csvfile)
model = pf.ARIMA(data=data,ar=4,ma=4,integ=0,target='sunspot.year')
x = model.fit("MLE")

list_of_results[index] = list_of_tuples[index] + (x.summary(),)

index++

I can load these CSV's in Big Query and want to parallelize this operation of sending the data into the ARIMA model as the operation of running the data through ARIMA model with these Files or BigQuery result can run in parallel so that I can save significant time on this operation.

Is there a way to achieve this in Google Data flow?

score 0 · Accepted Answer

如果所有 CSV 文件都在 GCS 上，您应该能够创建一个从 GCS 读取它们的简单管道，并在每个元素上并行运行您的模型。

ParDo请参阅有关并行处理所有元素的文档： https ://cloud.google.com/dataflow/model/par-do

python - Parallelize loop operation with Google Data Flow

1 回答 1

Related

Reference