I have a product to define and configure business workflows. A part of this product is a form-builder which enables users to setup different forms.
This entire forms data is backed on MongoDB in the following structure
- form_schemas
{
"_id" : "",
"name" : "",
"account_id" : "",
"fields" : [
{
"label" : "Enter Email",
"name" : "email",
"type" : "text",
"required" : "true",
"hidden" : "false",
"additional_config" : { }
},
{
"label" : "Select DOB",
"name" : "dob",
"type" : "date",
"required" : "true",
"hidden" : "false",
"additional_config" : { }
}
...
]
}
- form_datas
{
"workflow_template_id" : "",
"account_id" : ""
"data" : {
"email" : "xyx@gmail.com",
"dob" : "2001-04-05"
},
"created_at" : "",
"updated_at" : ""
}
As seen above the form can be for various different businesses. However, I am looking at data pipeline to transport the data to Google Bigquery at periodic intervals for analysis.
On BQ side, I am maintaining separate tables for each workflows
I have a current working solution which is completely written on Google Cloud Functions. I have a Google Scheduler Job run at periodic intervals invoking the different cloud functions. The cloud functions is doing the following things at high level
- Iterate for each schema
- Read the data mongodb for every schema since the last run (as cursor)
- For each row of data, run the custom transformation logic (this includes transforming various nested data types like grids/lookup etc)
- Write each row of transformed data directly as stream as ndjson on Google Cloud Storage
I above solution provides me with,
- Complete control on transformation
- Simple deployment
However since its all on CF, I am bound by limitation of 9 minutes per run. This essentially puts a lot of pagination requirements especially if there is a need to regenerate the complete data from beginning of time
While the above solution works fine for now, I was looking at other serverless options like Google data-flow. Since I am just starting on data-flow/apache beam, I was wondering
If I were to write a pipeline on beam, should I go with same approach of
- Extract(Row by Row) -> Transform -> Load (GCS) -> Load (BQ)
or
- Extract (entire data as JSON) -> Load to GCS -> Transform (Beam) -> Load to GCS -> Load to BQ
Let me know if there is any better option for entire data processing.