目前,我正在通过 databricks-connect 使用本地 VS Code 连接到 databricks。但是我的提交都带有找不到模块的错误,这意味着没有找到其他python文件中的代码。我试过了:
将代码移动到带有 main.py 的文件夹中
在使用它的函数中导入文件
通过 sparkContext.addPyFile 添加文件
有没有人有这方面的经验?或者更好的方式与 python 项目的数据块交互。
我似乎我的python部分代码是在本地python env中执行的,只有与spark相关的代码在集群中,但集群并没有加载我所有的python文件。然后引发错误。
我有文件夹
主文件
lib222.py
__init__.py
lib222.py 中的类 Foo
主要代码是:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
#sc.setLogLevel("INFO")
print("Testing addPyFile isolation")
sc.addPyFile("lib222.py")
from lib222 import Foo
print(sc.parallelize(range(10)).map(lambda i: Foo(2)).collect())
但我得到模块找不到 lib222 的错误。
此外,当我打印一些 sys 信息的 python 版本时,似乎 python 代码是在我的本地机器而不是远程驱动程序中执行的。我的数据库版本是 6.6。详细错误:
> Exception has occurred: Py4JJavaError
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 10.139.64.8, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 182, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 695, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'lib222'
>
>During handling of the above exception, another exception occurred:
>
>Traceback (most recent call last):
File "/databricks/spark/python/pyspark/worker.py", line 462, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/databricks/spark/python/pyspark/worker.py", line 71, in read_command
command = serializer._read_with_length(file)
File "/databricks/spark/python/pyspark/serializers.py", line 185, in _read_with_length
raise SerializationError("Caused by " + traceback.format_exc())
pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 182, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 695, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'lib222```