python - 在 colab 单元执行后持续存在的内存泄漏

Question

我遇到了微妙的内存泄漏，无法使用tracemalloc确定源。我在 google colab 中运行以下代码，旨在优化自定义ppo代理的超参数。此外，泄漏发生的速度也各不相同：有时它会在运行时的 10-20 分钟/5-10 次迭代内发生，而其他时候可能需要 50 次迭代/几个小时。这是一个包含完整代码的 colab笔记本。

import os
from time import perf_counter

import numpy as np
import optuna
import pandas as pd
import xagents
from tensorflow.keras.optimizers import Adam
from xagents import PPO
from xagents.utils.common import ModelReader, create_envs, write_from_dict


def get_hparams(trial):
    return {
        'n_steps': int(
            trial.suggest_categorical('n_steps', [2 ** i for i in range(2, 11)])
        ),
        'gamma': trial.suggest_loguniform('gamma', 0.9, 0.9999),
        'learning_rate': trial.suggest_loguniform('learning_rate', 1e-5, 1e-2),
        'epsilon': trial.suggest_loguniform('epsilon', 1e-7, 1e-1),
        'beta_1': trial.suggest_loguniform('beta_1', 0.01, 0.999),
        'beta_2': trial.suggest_loguniform('beta_2', 0.01, 0.999),
        'entropy_coef': trial.suggest_loguniform('entropy_coef', 1e-8, 2e-1),
        'n_envs': int(
            trial.suggest_categorical('n_envs', [2 ** i for i in range(4, 7)])
        ),
        'grad_norm': trial.suggest_uniform('grad_norm', 0.1, 10.0),
        'lam': trial.suggest_loguniform('lam', 0.65, 0.99),
        'advantage_epsilon': trial.suggest_loguniform('advantage_epsilon', 1e-8, 1e-1),
        'clip_norm': trial.suggest_loguniform('clip_norm', 0.01, 10),
    }


def optimize_agent(trial, seed=55):
    hparams = get_hparams(trial)
    envs = create_envs('BreakoutNoFrameskip-v4', hparams['n_envs'])
    model_cfg = xagents.agents['ppo']['model']['cnn'][0]
    optimizer = Adam(
        hparams['learning_rate'],
        epsilon=hparams['epsilon'],
        beta_1=hparams['beta_1'],
        beta_2=hparams['beta_2'],
    )
    model = ModelReader(
        model_cfg,
        [envs[0].action_space.n, 1],
        envs[0].observation_space.shape,
        optimizer,
        seed=seed,
    ).build_model()
    agent = PPO(
        envs,
        model,
        entropy_coef=hparams['entropy_coef'],
        grad_norm=hparams['grad_norm'],
        gamma=hparams['gamma'],
        n_steps=hparams['n_steps'],
        seed=seed,
        lam=hparams['lam'],
        advantage_epsilon=hparams['advantage_epsilon'],
        clip_norm=hparams['clip_norm'],
    )
    steps = 150000
    agent.fit(max_steps=steps)
    current_rewards = np.around(np.mean(agent.total_rewards), 2)
    if not np.isfinite(current_rewards):
        current_rewards = 0
    hparams[f'mean_after_{steps}_steps'] = current_rewards
    write_from_dict(
        {key: [val] for (key, val) in hparams.items()}, 'ppo-optuna.parquet'
    )
    trial.report(current_rewards, 1000)
    snapshot = tracemalloc.take_snapshot()
    top_stats = snapshot.statistics('lineno')
    frame = pd.DataFrame(
        [
            {'traceback': item.traceback, 'size': item.size, 'count': item.count}
            for item in top_stats
        ]
    )
    frame.to_csv(f'memory-snapshots/snapshot-{perf_counter()}.csv', index=False)
    return current_rewards

这就是我运行的：

os.mkdir('memory-snapshots')
study = optuna.create_study(
    study_name='ppo-trials150',
    load_if_exists=True,
    storage="sqlite:///example-ppo.db",
    direction='maximize',
)
study.optimize(
    optimize_agent, n_trials=1000, show_progress_bar=True, gc_after_trial=True
)

这会导致内存迅速膨胀，直到会话崩溃。

如果我在单元崩溃之前停止它，内存问题会一直存在，直到运行时重新启动。

以下是崩溃前 15 次连续迭代产生的内存快照，它们没有显示任何特定的危险信号。此外，通过总结size最近快照的列，总共有 967249710 字节 ~= 1GB，这很奇怪，因为 colab 上的可用内存 ~= 12GB。以下是前 23 个回溯：

这是崩溃日志之一：

Timestamp                   Level   Message
Jul 13, 2021, 9:58:11 AM    WARNING WARNING:root:kernel 834cf1f9-a397-4369-b54a-0cf6da2e980f restarted
Jul 13, 2021, 9:58:11 AM    INFO    KernelRestarter: restarting kernel (1/5), keep random ports
Jul 13, 2021, 9:22:48 AM    WARNING tcmalloc: large alloc 1849688064 bytes == 0x55831c4f8000 @ 0x7fdc571471e7 0x7fdc4e5bd46e 0x7fdc4e60dc7b 0x7fdc4e610e83 0x7fdc4e61107b 0x7fdc4e6b2761 0x558257855d54 0x558257855a50 0x5582578ca105 0x5582578c44ae 0x5582578573ea 0x5582578c97f0 0x5582578c44ae 0x5582578573ea 0x5582578c53b5 0x5582578c47ad 0x558257857a81 0x558257857ea1 0x5582578c6bb5 0x5582578c47ad 0x558257796eb1 0x5582578c6bb5 0x5582578c47ad 0x558257857a81 0x55825789afd9 0x558257857ea1 0x7fdbefb83954 0x7fdbefb873ba 0x7fdbeaf6eeb4 0x7fdbeaf631fe 0x7fdbf042b441
Jul 13, 2021, 9:21:30 AM    WARNING tcmalloc: large alloc 1849688064 bytes == 0x5583a976a000 @ 0x7fdc571471e7 0x7fdc4e5bd46e 0x7fdc4e60dc7b 0x7fdc4e60dd97 0x7fdc4e6074a5 0x7fdc4e6d829c 0x7fdc4e6a5dd1 0x558257797338 0x5582578cb1ba 0x5582578c44ae 0x5582578573ea 0x5582578c97f0 0x5582578c47ad 0x558257857a81 0x558257857ea1 0x5582578c6bb5 0x5582578c47ad 0x558257796eb1 0x5582578c6bb5 0x5582578c47ad 0x558257857a81 0x55825789afd9 0x558257857ea1 0x7fdbefb83954 0x7fdbefb873ba 0x7fdbeaf6eeb4 0x7fdbeaf631fe 0x7fdbf042b441 0x7fdbf0428133 0x7fdbeb6aad75 0x7fdc56efc6db
Jul 13, 2021, 9:21:27 AM    WARNING tcmalloc: large alloc 1849688064 bytes == 0x55831c4f8000 @ 0x7fdc571471e7 0x7fdc4e5bd46e 0x7fdc4e60dc7b 0x7fdc4e610e83 0x7fdc4e61107b 0x7fdc4e6b2761 0x558257855d54 0x558257855a50 0x5582578ca105 0x5582578c44ae 0x5582578573ea 0x5582578c97f0 0x5582578c44ae 0x5582578573ea 0x5582578c53b5 0x5582578c47ad 0x558257857a81 0x558257857ea1 0x5582578c6bb5 0x5582578c47ad 0x558257796eb1 0x5582578c6bb5 0x5582578c47ad 0x558257857a81 0x55825789afd9 0x558257857ea1 0x7fdbefb83954 0x7fdbefb873ba 0x7fdbeaf6eeb4 0x7fdbeaf631fe 0x7fdbf042b441
Jul 13, 2021, 9:20:08 AM    WARNING 2021-07-13 07:20:08.307941: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.55GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Jul 13, 2021, 9:20:08 AM    WARNING 2021-07-13 07:20:08.307851: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.55GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Jul 13, 2021, 9:20:08 AM    WARNING 2021-07-13 07:20:08.221051: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Jul 13, 2021, 9:20:08 AM    WARNING 2021-07-13 07:20:08.220943: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Jul 13, 2021, 9:20:06 AM    WARNING 2021-07-13 07:20:06.646798: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Jul 13, 2021, 9:20:06 AM    WARNING 2021-07-13 07:20:06.646709: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Jul 13, 2021, 9:20:06 AM    WARNING 2021-07-13 07:20:06.122847: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.55GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Jul 13, 2021, 9:20:06 AM    WARNING 2021-07-13 07:20:06.122724: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.55GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Jul 13, 2021, 9:20:03 AM    WARNING tcmalloc: large alloc 1849688064 bytes == 0x5583a976a000 @ 0x7fdc571471e7 0x7fdc4e5bd46e 0x7fdc4e60dc7b 0x7fdc4e60dd97 0x7fdc4e6074a5 0x7fdc4e6d829c 0x7fdc4e6a5dd1 0x558257797338 0x5582578cb1ba 0x5582578c44ae 0x5582578573ea 0x5582578c97f0 0x5582578c47ad 0x558257857a81 0x558257857ea1 0x5582578c6bb5 0x5582578c47ad 0x558257796eb1 0x5582578c6bb5 0x5582578c47ad 0x558257857a81 0x55825789afd9 0x558257857ea1 0x7fdbefb83954 0x7fdbefb873ba 0x7fdbeaf6eeb4 0x7fdbeaf631fe 0x7fdbf042b441 0x7fdbf0428133 0x7fdbeb6aad75 0x7fdc56efc6db
Jul 13, 2021, 9:20:01 AM    WARNING tcmalloc: large alloc 1849688064 bytes == 0x55833a870000 @ 0x7fdc571471e7 0x7fdc4e5bd46e 0x7fdc4e60dc7b 0x7fdc4e610e83 0x7fdc4e61107b 0x7fdc4e6b2761 0x558257855d54 0x558257855a50 0x5582578ca105 0x5582578c44ae 0x5582578573ea 0x5582578c97f0 0x5582578c44ae 0x5582578573ea 0x5582578c53b5 0x5582578c47ad 0x558257857a81 0x558257857ea1 0x5582578c6bb5 0x5582578c47ad 0x558257796eb1 0x5582578c6bb5 0x5582578c47ad 0x558257857a81 0x55825789afd9 0x558257857ea1 0x7fdbefb83954 0x7fdbefb873ba 0x7fdbeaf6eeb4 0x7fdbeaf631fe 0x7fdbf042b441
Jul 13, 2021, 9:18:49 AM    WARNING 2021-07-13 07:18:49.736991: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
Jul 13, 2021, 9:18:47 AM    WARNING 2021-07-13 07:18:47.153482: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
Jul 13, 2021, 9:18:23 AM    WARNING 2021-07-13 07:18:23.372605: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8004
Jul 13, 2021, 9:18:21 AM    WARNING 2021-07-13 07:18:21.045353: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
Jul 13, 2021, 9:18:18 AM    WARNING 2021-07-13 07:18:18.540742: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2199995000 Hz
Jul 13, 2021, 9:18:18 AM    WARNING 2021-07-13 07:18:18.470990: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
Jul 13, 2021, 9:17:53 AM    WARNING 2021-07-13 07:17:53.026567: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13837 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5)
Jul 13, 2021, 9:17:53 AM    WARNING 2021-07-13 07:17:53.026490: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
Jul 13, 2021, 9:17:53 AM    WARNING 2021-07-13 07:17:53.025754: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Jul 13, 2021, 9:17:53 AM    WARNING 2021-07-13 07:17:53.024324: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Jul 13, 2021, 9:17:53 AM    WARNING 2021-07-13 07:17:53.022506: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Jul 13, 2021, 9:17:53 AM    WARNING 2021-07-13 07:17:53.021712: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
Jul 13, 2021, 9:17:53 AM    WARNING 2021-07-13 07:17:53.021041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
Jul 13, 2021, 9:17:53 AM    WARNING 2021-07-13 07:17:53.019732: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.878647: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.875397: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.874569: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.873445: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Jul 13, 2021, 9:17:47 AM    WARNING coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
Jul 13, 2021, 9:17:47 AM    WARNING pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.873293: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.872460: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.871015: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.867034: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.865768: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.865559: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.860081: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.843825: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.10
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.552444: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.509098: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.374421: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.374270: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.232497: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Jul 13, 2021, 9:17:47 AM    WARNING coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
Jul 13, 2021, 9:17:47 AM    WARNING pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.232421: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.231418: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.168074: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
Jul 13, 2021, 9:17:31 AM    WARNING 2021-07-13 07:17:31.798739: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Jul 13, 2021, 9:13:58 AM    INFO    Uploading file to /content/xagents.tar.gz
Jul 13, 2021, 9:13:33 AM    INFO    Adapting to protocol v5.1 for kernel 834cf1f9-a397-4369-b54a-0cf6da2e980f
Jul 13, 2021, 9:13:32 AM    INFO    Kernel started: 834cf1f9-a397-4369-b54a-0cf6da2e980f
Jul 13, 2021, 9:13:24 AM    INFO    Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Jul 13, 2021, 9:13:24 AM    INFO    http://172.28.0.12:9000/
Jul 13, 2021, 9:13:24 AM    INFO    The Jupyter Notebook is running at:
Jul 13, 2021, 9:13:24 AM    INFO    0 active kernels
Jul 13, 2021, 9:13:24 AM    INFO    Serving notebooks from local directory: /
Jul 13, 2021, 9:13:24 AM    INFO    google.colab serverextension initialized.
Jul 13, 2021, 9:13:24 AM    INFO    Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Jul 13, 2021, 9:13:24 AM    INFO    http://172.28.0.2:9000/
Jul 13, 2021, 9:13:24 AM    INFO    The Jupyter Notebook is running at:
Jul 13, 2021, 9:13:24 AM    INFO    0 active kernels
Jul 13, 2021, 9:13:24 AM    INFO    Serving notebooks from local directory: /
Jul 13, 2021, 9:13:24 AM    INFO    google.colab serverextension initialized.
Jul 13, 2021, 9:13:24 AM    INFO    Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
Jul 13, 2021, 9:13:24 AM    INFO    Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret

python - 在 colab 单元执行后持续存在的内存泄漏

0 回答 0

Related

Reference