For a script that I am running, I have a bunch of chained views that looked at a specific set of data in sql (I am using Apache Spark SQL):
%sql
create view view_1 as
select column_1,column_2 from original_data_table
This logic culminates in view_n.
However, I then need to perform logic that is difficult (or impossible) to implement in sql, specifically, the explode command:
%python
df_1 = sqlContext.sql("SELECT * from view_n")
df1_exploded=df_1.withColumn("exploded_column", explode(split(df_1f.col_to_explode,',')))
My Questions:
Is there a speed cost associated with switching to and from sql tables to pyspark dataframes? Or, since pyspark dataframes are lazily evaluated, is it very similair to a view?
Is there a better way of switching from and sql table to a pyspark dataframe?