amazon-web-services - 如何从用 pyspark 编写的胶水 ETL 作业中保存 S3 中的机器学习模型（Kmeans）？

Question

我尝试了 model.save(sc, path) 它给了我错误：TypeError: save() 需要 2 个位置参数，但给出了 3 个。这里 sc 是 sparkcontext [sc = SparkContext()] 我尝试在签名中不使用 sc，但出现此错误：调用 o159.save 时发生错误。java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.mapred.DirectOutputCommitter not found 我尝试了多种使用boto3 pickle joblib的方法，但我没有成功找到可行的解决方案。我正在创建一个 KMeans 聚类模型。我需要一个胶水作业来适应并将模型保存在 S3 中，然后再进行另一个胶水作业来通过加载保存的模型来进行预测。我第一次这样做，任何帮助将不胜感激。

score 0 · Accepted Answer

在 SparkContext 之后添加该行解决了我的问题。

sc = SparkContext()

sc._jsc.hadoopConfiguration().set("mapred.output.committer.class", "org.apache.hadoop.mapred.DirectFileOutputCommitter")

amazon-web-services - 如何从用 pyspark 编写的胶水 ETL 作业中保存 S3 中的机器学习模型（Kmeans）？

1 回答 1

Related

Reference