正如我在这里概述的那样,由于硬件原因,例如使用 ppc64le IBM 架构,我被困在使用旧版本的 pytorch 和 torchvision。
出于这个原因,我在不同计算机、集群和我的个人 Mac 之间发送和接收检查点时遇到问题。我想知道是否有任何方法可以避免这个问题来加载模型?例如,在使用 1.6.x 时,可能会以新旧格式保存模型。当然,对于 1.3.1 到 1.6.x 来说是不可能的,但至少我希望有些东西能奏效。
有什么建议吗?当然,我理想的解决方案是我不必担心它,我总是可以加载和保存我的检查点以及我通常在所有硬件上统一腌制的所有内容。
我得到的第一个错误是 zip jit 错误:
RuntimeError: /home/miranda9/data/f.pt is a zip archive (did you mean to use torch.jit.load()?)
所以我使用了它(和其他泡菜库):
# %%
import torch
from pathlib import Path
def load(path):
import torch
import pickle
import dill
path = str(path)
try:
db = torch.load(path)
f = db['f']
except Exception as e:
db = torch.jit.load(path)
f = db['f']
#with open():
# db = pickle.load(open(path, "r+"))
# db = dill.load(open(path, "r+"))
#raise ValueError(f'FAILED: {e}')
return db, f
p = "~/data/f.pt"
path = Path(p).expanduser()
db, f = load(path)
Din, nb_examples = 1, 5
x = torch.distributions.Normal(loc=0.0, scale=1.0).sample(sample_shape=(nb_examples, Din))
y = f(x)
print(y)
print('Success!\a')
但我抱怨我被迫使用不同的 pytorch 版本:
Traceback (most recent call last):
File "hal_pg.py", line 27, in <module>
db, f = load(path)
File "hal_pg.py", line 16, in load
db = torch.jit.load(path)
File "/home/miranda9/.conda/envs/wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/jit/__init__.py", line 239, in load
cpp_module = torch._C.import_ir_module(cu, f, map_location, _extra_files)
RuntimeError: version_number <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at /opt/anaconda/conda-bld/pytorch-base_1581395437985/work/caffe2/serialize/inline_container.cc:131, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 1. Your PyTorch installation may be too old. (init at /opt/anaconda/conda-bld/pytorch-base_1581395437985/work/caffe2/serialize/inline_container.cc:131)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xbc (0x7fff7b527b9c in /home/miranda9/.conda/envs/wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: caffe2::serialize::PyTorchStreamReader::init() + 0x1d98 (0x7fff1d293c78 in /home/miranda9/.conda/envs/wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #2: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x88 (0x7fff1d2950d8 in /home/miranda9/.conda/envs/wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #3: torch::jit::import_ir_module(std::shared_ptr<torch::jit::script::CompilationUnit>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0x64 (0x7fff1e624664 in /home/miranda9/.conda/envs/wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #4: <unknown function> + 0x70e210 (0x7fff7c0ae210 in /home/miranda9/.conda/envs/wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x28efc4 (0x7fff7bc2efc4 in /home/miranda9/.conda/envs/wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #26: <unknown function> + 0x25280 (0x7fff84b35280 in /lib64/libc.so.6)
frame #27: __libc_start_main + 0xc4 (0x7fff84b35474 in /lib64/libc.so.6)
任何想法如何使集群中的所有内容保持一致?我什至无法打开泡菜文件。
也许这对于我被迫使用的当前 pytorch 版本来说是不可能的:(
RuntimeError: version_number <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at /opt/anaconda/conda-bld/pytorch-base_1581395437985/work/caffe2/serialize/inline_container.cc:131, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 1. Your PyTorch installation may be too old. (init at /opt/anaconda/conda-bld/pytorch-base_1581395437985/work/caffe2/serialize/inline_container.cc:131)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xbc (0x7fff83ba7b9c in /home/miranda9/.conda/envs/automl-meta-learning_wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: caffe2::serialize::PyTorchStreamReader::init() + 0x1d98 (0x7fff25993c78 in /home/miranda9/.conda/envs/automl-meta-learning_wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #2: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x88 (0x7fff259950d8 in /home/miranda9/.conda/envs/automl-meta-learning_wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #3: torch::jit::import_ir_module(std::shared_ptr<torch::jit::script::CompilationUnit>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0x64 (0x7fff26d24664 in /home/miranda9/.conda/envs/automl-meta-learning_wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #4: <unknown function> + 0x70e210 (0x7fff8472e210 in /home/miranda9/.conda/envs/automl-meta-learning_wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x28efc4 (0x7fff842aefc4 in /home/miranda9/.conda/envs/automl-meta-learning_wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #23: <unknown function> + 0x25280 (0x7fff8d335280 in /lib64/libc.so.6)
frame #24: __libc_start_main + 0xc4 (0x7fff8d335474 in /lib64/libc.so.6)
使用代码:
from pathlib import Path
import torch
path = '/home/miranda9/data/dataset/'
path = Path(path).expanduser() / 'fi_db.pt'
path = str(path)
# db = torch.load(path)
# torch.jit.load(path)
db = torch.jit.load(str(path))
print(db)
相关链接:
- 如何使用 ppc64le 和 x86 跨不同版本的 pytorch(1.3.1 和 1.6.x)加载检查点?
- https://discuss.pytorch.org/t/how-to-load-checkpoints-across-different-versions-of-pytorch-1-3-1-and-1-6-x-using-ppc64le-and- x86/97829
- 相关的gitissue:https ://github.com/pytorch/pytorch/issues/43766
- reddit:https ://www.reddit.com/r/pytorch/comments/jvza7v/how_to_load_checkpoints_across_different_versions/