I'm trying to perform a simple task in spark dataframe (python) which is create new dataframe by selecting specific column and nested columns from another dataframe for example :
df.printSchema()
root
|-- time_stamp: long (nullable = true)
|-- country: struct (nullable = true)
| |-- code: string (nullable = true)
| |-- id: long (nullable = true)
| |-- time_zone: string (nullable = true)
|-- event_name: string (nullable = true)
|-- order: struct (nullable = true)
| |-- created_at: string (nullable = true)
| |-- creation_type: struct (nullable = true)
| | |-- id: long (nullable = true)
| | |-- name: string (nullable = true)
| |-- destination: struct (nullable = true)
| | |-- state: string (nullable = true)
| |-- ordering_user: struct (nullable = true)
| | |-- cancellation_score: long (nullable = true)
| | |-- id: long (nullable = true)
| | |-- is_test: boolean (nullable = true)
df2=df.sqlContext.sql("""select a.country_code as country_code,
a.order_destination_state as order_destination_state,
a.order_ordering_user_id as order_ordering_user_id,
a.order_ordering_user_is_test as order_ordering_user_is_test,
a.time_stamp as time_stamp
from
(select
flat_order_creation.order.destination.state as order_destination_state,
flat_order_creation.order.ordering_user.id as order_ordering_user_id,
flat_order_creation.order.ordering_user.is_test as order_ordering_user_is_test,
flat_order_creation.time_stamp as time_stamp
from flat_order_creation) a""")
and I get the following error:
Traceback (most recent call last):
File "/home/hadoop/scripts/orders_all.py", line 180, in <module>
df2=sqlContext.sql(q)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 552, in sql
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 36, in deco
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o60.sql.
: java.lang.RuntimeException: [6.21] failure: ``*'' expected but `order' found
flat_order_creation.order.destination.state as order_destination_state,
I'm using spark-submit with master in local mode to run the this code. it important to mention the when I'm connecting to pyspark shell and run the code (line by line) it works , but when submitting it (even in local mode) it fails. another thing is important to mention is that when selecting a non nested field it works as well. I'm using spark 1.5.2 on EMR (version 4.2.0)