python - 在 Spark\PySpark 中保存\加载模型的正确方法是什么

Question

我正在使用 PySpark 和 MLlib 使用 Spark 1.3.0，我需要保存和加载我的模型。我使用这样的代码（取自官方文档）

from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating

data = sc.textFile("data/mllib/als/test.data")
ratings = data.map(lambda l: l.split(',')).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
rank = 10
numIterations = 20
model = ALS.train(ratings, rank, numIterations)
testdata = ratings.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
predictions.collect() # shows me some predictions
model.save(sc, "model0")

# Trying to load saved model and work with it
model0 = MatrixFactorizationModel.load(sc, "model0")
predictions0 = model0.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))

在我尝试使用 model0 之后，我得到了一个很长的回溯，并以此结束：

Py4JError: An error occurred while calling o70.predict. Trace:
py4j.Py4JException: Method predict([class org.apache.spark.api.java.JavaRDD]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
    at py4j.Gateway.invoke(Gateway.java:252)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)

所以我的问题是——我做错了什么吗？据我调试，我的模型存储在（本地和 HDFS 上）并且它们包含许多带有一些数据的文件。我感觉模型保存正确，但可能没有正确加载。我也四处搜索，但没有发现任何相关内容。

看起来这个保存\加载功能最近已在 Spark 1.3.0 中添加，因此我还有另一个问题 - 在 1.3.0 版本之前保存\加载模型的推荐方法是什么？我还没有找到任何好的方法来做到这一点，至少对于 Python 而言。我也尝试过 Pickle，但遇到了与此处所述相同的问题Save Apache Spark mllib model in python

score 7 · Accepted Answer

保存模型的一种方法（在 Scala 中；但在 Python 中可能类似）：

// persist model to HDFS
sc.parallelize(Seq(model), 1).saveAsObjectFile("linReg.model")

然后可以将保存的模型加载为：

val linRegModel = sc.objectFile[LinearRegressionModel]("linReg.model").first()

另见相关问题

有关更多详细信息，请参阅（参考）

score 5 · Accepted Answer

As of this pull request merged on Mar 28, 2015 (a day after your question was last edited) this issue has been resolved.

You just need to clone/fetch the latest version from GitHub (git clone git://github.com/apache/spark.git -b branch-1.3) then build it (following the instructions in spark/README.md) with $ mvn -DskipTests clean package.

Note: I ran into trouble building Spark because Maven was being wonky. I resolved that issue by using $ update-alternatives --config mvn and selecting the 'path' that had Priority: 150, whatever that means. Explanation here.

score 2 · Accepted Answer

2

我也遇到了这个——它看起来像一个错误。我报告说要激发 jira。

于 2015-03-27T13:52:00.327 回答

score 2 · Accepted Answer

在 ML 中使用 pipeline 训练模型，然后使用 MLWriter 和 MLReader 保存模型并读回模型。

from pyspark.ml import Pipeline
from pyspark.ml import PipelineModel

pipeTrain.write().overwrite().save(outpath)
model_in = PipelineModel.load(outpath)

python - 在 Spark\PySpark 中保存\加载模型的正确方法是什么

4 回答 4

Related

Reference