python - 用 pickle 或 dill 序列化 main 中的对象

Question

我有酸洗问题。我想在我的主脚本中序列化一个函数，然后加载它并在另一个脚本中运行它。为了证明这一点，我制作了 2 个脚本：

尝试1：天真的方式：

dill_pickle_script_1.py

import pickle
import time

def my_func(a, b):
    time.sleep(0.1)  # The purpose of this will become evident at the end
    return a+b

if __name__ == '__main__':
    with open('testfile.pkl', 'wb') as f:
        pickle.dump(my_func, f)

dill_pickle_script_2.py

import pickle

if __name__ == '__main__':
    with open('testfile.pkl') as f:
        func = pickle.load(f)
        assert func(1, 2)==3

问题：当我运行脚本 2 时，我得到AttributeError: 'module' object has no attribute 'my_func'. 我明白为什么：因为当my_func在script1中序列化时，它属于__main__模块。dill_pickle_script_2 不知道__main__那里引用了 dill_pickle_script_1 的命名空间，因此找不到引用。

尝试 2：插入绝对导入

我通过添加一个小技巧解决了这个问题 - 在腌制之前，我在 dill_pickle_script_1 中向 my_func 添加了一个绝对导入。

dill_pickle_script_1.py

import pickle
import time

def my_func(a, b):
    time.sleep(0.1)
    return a+b

if __name__ == '__main__':
    from dill_pickle_script_1 import my_func  # Added absolute import
    with open('testfile.pkl', 'wb') as f:
        pickle.dump(my_func, f)

现在它起作用了！但是，我想避免每次我想这样做时都必须这样做。（另外，我想让我的酸洗在其他一些不知道 my_func 来自哪个模块的模块中完成）。

尝试3：莳萝

我认为 package dill可以让你在 main 中序列化东西并将它们加载到其他地方。所以我尝试了：

dill_pickle_script_1.py

import dill
import time

def my_func(a, b):
    time.sleep(0.1)
    return a+b

if __name__ == '__main__':
    with open('testfile.pkl', 'wb') as f:
        dill.dump(my_func, f)

dill_pickle_script_2.py

import dill

if __name__ == '__main__':
    with open('testfile.pkl') as f:
        func = dill.load(f)
        assert func(1, 2)==3

然而，现在我有另一个问题：运行时dill_pickle_script_2.py，我得到一个NameError: global name 'time' is not defined. 似乎 dill 没有意识到 my_func 引用了该time模块并且必须在加载时导入它。

我的问题？

如何在 main 中序列化一个对象，然后在另一个脚本中再次加载它，以便该对象使用的所有导入也被加载，而无需在尝试 2 中进行讨厌的小黑客攻击？

score 3 · Accepted Answer

好吧，我找到了解决方案。这是一个可怕但整洁的组合，并不能保证在所有情况下都能正常工作。欢迎任何改进建议。该解决方案涉及使用以下辅助函数将主引用替换为 pickle 字符串中的绝对模块引用：

import sys
import os

def pickle_dumps_without_main_refs(obj):
    """
    Yeah this is horrible, but it allows you to pickle an object in the main module so that it can be reloaded in another
    module.
    :param obj:
    :return:
    """
    currently_run_file = sys.argv[0]
    module_path = file_path_to_absolute_module(currently_run_file)
    pickle_str = pickle.dumps(obj, protocol=0)
    pickle_str = pickle_str.replace('__main__', module_path)  # Hack!
    return pickle_str


def pickle_dump_without_main_refs(obj, file_obj):
    string = pickle_dumps_without_main_refs(obj)
    file_obj.write(string)


def file_path_to_absolute_module(file_path):
    """
    Given a file path, return an import path.
    :param file_path: A file path.
    :return:
    """
    assert os.path.exists(file_path)
    file_loc, ext = os.path.splitext(file_path)
    assert ext in ('.py', '.pyc')
    directory, module = os.path.split(file_loc)
    module_path = [module]
    while True:
        if os.path.exists(os.path.join(directory, '__init__.py')):
            directory, package = os.path.split(directory)
            module_path.append(package)
        else:
            break
    path = '.'.join(module_path[::-1])
    return path

现在，我可以简单地改成dill_pickle_script_1.py说

import time
from artemis.remote.child_processes import pickle_dump_without_main_refs


def my_func(a, b):
    time.sleep(0.1)
    return a+b

if __name__ == '__main__':
    with open('testfile.pkl', 'wb') as f:
        pickle_dump_without_main_refs(my_func, f)

然后dill_pickle_script_2.py工作！

score 1 · Accepted Answer

您可以dill.dump使用recurse=True或dill.settings["recurse"] = True。它将捕获闭包：

在文件 A 中：

import time
import dill

def my_func(a, b):
  time.sleep(0.1)
  return a + b

with open("tmp.pkl", "wb") as f:
  dill.dump(my_func, f, recurse=True)

在文件 B 中：

import dill

with open("tmp.pkl", "rb") as f:
  my_func = dill.load(f)

python - 用 pickle 或 dill 序列化 __main__ 中的对象

尝试1：天真的方式：

尝试 2：插入绝对导入

尝试3：莳萝

我的问题？

2 回答 2

Related

Reference

python - 用 pickle 或 dill 序列化 main 中的对象