The situation is as follows: working on a enterprise cluster with spark 2.3, I want to run pandas_udf which requires pyarrow which requires numpy 0.14 (AFAIK). Been able to distribute pyarrow (I think, no way of verifying this 100%):
pyspark.sql.SparkSession.builder.appName("pandas_udf_poc").config("spark.executor.instances","2")\
.config("spark.executor.memory","8g")\
.config("spark.driver.memory","8g")\
.config("spark.driver.maxResultSize","8g")\
.config("py-files", "pyarrow_depnd.zip")\
.getOrCreate()
spark.sparkContext.addPyFile("pyarrow_depnd.zip")
The zip is the result of pip install to dir and zipping it.
But pyarrow does not play along with the nodes numpy 0.13, I guess I could try and distribute a full env to all nodes, but my question is, is there a way to avoid this and make the node use a diffrent numpy (which is already distributed in the pyarrow zip)
Thanks