tensorflow - StyleGAN2-ada，运行时运行，但 epoch 保持为零

Question

我在Colab 运行时运行它......但即使我等待超过 5 个小时，epoch 也不会上升。有什么问题要查吗？？

!nvidia-smi
> NVIDIA-SMI 455.32.00
Driver Version: 418.67
CUDA Version: 10.1
Tesla V100-SXM2...
24W / 300W |0MiB / **16130MiB** |      0%      Default |

%cat /proc/meminfo | grep MemTotal
> MemTotal:       **26751732 kB**

%cat /proc/sys/vm/overcommit_memory
> 1

%tensorflow_version 1.x
> **TensorFlow 1.x selected.**

from google.colab import drive
drive.mount('/content/drive')
> Mounted at /content/drive

%cd '/content/drive/My Drive/colab-sg2-ada/stylegan2-ada'
> /content/drive/My Drive/colab-sg2-ada/stylegan2-ada

zip_path = '/content/drive/My\ Drive/glitchv2.zip'
!unzip {zip_path} -d /content/
> unzip~~~~~


images are **256x256** pixel 1200 images


dataset_path = '/content/dog'
dataset_name = 'glitchv2'

!python dataset_tool.py create_from_images ./datasets/{dataset_name} {dataset_path}
> Loading images from "/content/dog"
  Creating dataset "./datasets/glitchv2"
  Added 1200 images.

snapshot_count = 4
augs = 'bg'

!python train.py --outdir ./results --cfg=11gb-gpu --snap={snapshot_count} --data=./datasets/{dataset_name} 
>
tcmalloc: large alloc 4294967296 bytes == 0x885c000 @ ~~~
tcmalloc: large alloc 4294967296 bytes == 0x885c000 @ ~~~
tcmalloc: large alloc 4294967296 bytes == 0x885c000 @ ~~~

Training options:{
"G_args": {
&nbsp;&nbsp;&nbsp;&nbsp;"func_name": "training.networks.G_main",
&nbsp;&nbsp;&nbsp;&nbsp;"fmap_base": 16384,
&nbsp;&nbsp;&nbsp;&nbsp;"fmap_max": 512,
&nbsp;&nbsp;&nbsp;&nbsp;"mapping_layers": 8,
&nbsp;&nbsp;&nbsp;&nbsp;"num_fp16_res": 4,
&nbsp;&nbsp;&nbsp;&nbsp;"conv_clamp": 256
  },
  "D_args": {
&nbsp;&nbsp;&nbsp;&nbsp;    "func_name": "training.networks.D_main",
&nbsp;&nbsp;&nbsp;&nbsp;    "mbstd_group_size": 4,
&nbsp;&nbsp;&nbsp;&nbsp;    "fmap_base": 16384,
&nbsp;&nbsp;&nbsp;&nbsp;    "fmap_max": 512,
&nbsp;&nbsp;&nbsp;&nbsp;    "num_fp16_res": 4,
&nbsp;&nbsp;&nbsp;&nbsp;    "conv_clamp": 256
  },
  "G_opt_args": {
&nbsp;&nbsp;&nbsp;&nbsp;    "beta1": 0.0,
&nbsp;&nbsp;&nbsp;&nbsp;    "beta2": 0.99,
&nbsp;&nbsp;&nbsp;&nbsp;    "learning_rate": 0.002
  },
  "D_opt_args": {
&nbsp;&nbsp;&nbsp;&nbsp;    "beta1": 0.0,
&nbsp;&nbsp;&nbsp;&nbsp;    "beta2": 0.99,
&nbsp;&nbsp;&nbsp;&nbsp;    "learning_rate": 0.002
  },
  "loss_args": {
&nbsp;&nbsp;&nbsp;&nbsp;    "func_name": "training.loss.stylegan2",
&nbsp;&nbsp;&nbsp;&nbsp;    "r1_gamma": 10
  },
  "augment_args": {
&nbsp;&nbsp;&nbsp;&nbsp;     "class_name": "training.augment.AdaptiveAugment",
&nbsp;&nbsp;&nbsp;&nbsp;     "tune_heuristic": "rt",
&nbsp;&nbsp;&nbsp;&nbsp;     "tune_target": 0.6,
&nbsp;&nbsp;&nbsp;&nbsp;     "apply_func": "training.augment.augment_pipeline",
&nbsp;&nbsp;&nbsp;&nbsp;     "apply_args": {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;      "xflip": 1,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;      "rotate90": 1,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;      "xint": 1,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;      "scale": 1,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;      "rotate": 1,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;      "aniso": 1,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;      "xfrac": 1,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;      "brightness": 1,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;      "contrast": 1,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;      "lumaflip": 1,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;      "hue": 1,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;      "saturation": 1
&nbsp;&nbsp;&nbsp;&nbsp;    }
  },
  "num_gpus": 1,
  "image_snapshot_ticks": 4,
  "network_snapshot_ticks": 4,
  "train_dataset_args": {
&nbsp;&nbsp;&nbsp;&nbsp;    "path": "./datasets/glitchv2",
&nbsp;&nbsp;&nbsp;&nbsp;    "max_label_size": 0,
&nbsp;&nbsp;&nbsp;&nbsp;    "use_raw": false,
&nbsp;&nbsp;&nbsp;&nbsp;    "resolution": 256,
&nbsp;&nbsp;&nbsp;&nbsp;    "mirror_augment": false,
&nbsp;&nbsp;&nbsp;&nbsp;    "mirror_augment_v": false
  },
  "metric_arg_list": [
&nbsp;&nbsp;&nbsp;&nbsp;    {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;      "name": "fid50k_full",
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;      "class_name": "metrics.frechet_inception_distance.FID",
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;      "max_reals": null,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;      "num_fakes": 50000,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;      "minibatch_per_gpu": 8,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;      "force_dataset_args": {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;        "shuffle": false,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;        "max_images": null,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;        "repeat": false,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;        "mirror_augment": false
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }
&nbsp;&nbsp;&nbsp;&nbsp;    }
],
 "metric_dataset_args": {
&nbsp;&nbsp;&nbsp;&nbsp;    "path": "./datasets/glitchv2",
&nbsp;&nbsp;&nbsp;&nbsp;    "max_label_size": 0,
&nbsp;&nbsp;&nbsp;&nbsp;    "use_raw": false,
&nbsp;&nbsp;&nbsp;&nbsp;    "resolution": 256,
 &nbsp;&nbsp;&nbsp;&nbsp;   "mirror_augment": false,
&nbsp;&nbsp;&nbsp;&nbsp;    "mirror_augment_v": false
  },
  "total_kimg": 25000,
  "minibatch_size": 4,
  "minibatch_gpu": 4,
  "G_smoothing_kimg": 10,
  "G_smoothing_rampup": null,
  "run_dir": "./results/00001-glitchv2-11gb-gpu"
}

Output directory:  ./results/00001-glitchv2-11gb-gpu
Training data:     ./datasets/glitchv2
Training length:   25000 kimg
Resolution:        256
Number of GPUs:    1

Creating output directory...
Loading training set...

tcmalloc: large alloc 4294967296 bytes == 0x7f97addd0000 @  0x7f9b908a6001 ~~~
tcmalloc: large alloc 4294967296 bytes == 0x7f96ad5d0000 @  0x7f9b908a41e7 ~~~
tcmalloc: large alloc 4294967296 bytes == 0x7f96ad5d0000 @  0x7f9b908a41e7 ~~~
Image shape: [3, 256, 256]
Label shape: [0]

Constructing networks...
Setting up TensorFlow plugin "fused_bias_act.cu": Compiling... Loading... Done.
Setting up TensorFlow plugin "upfirdn_2d.cu": Compiling... Loading... Done.

G        &nbsp;&nbsp; Params &nbsp;&nbsp;   OutputShape  &nbsp;&nbsp;   WeightShape     
---                           ---       ---                 ---             
latents_in  &nbsp;&nbsp;    -         (?, 512)            -               
labels_in   &nbsp;&nbsp;      -         (?, 0)              -               
G_mapping/Normalize &nbsp;&nbsp;    -         (?, 512)            -               
G_mapping/Dense0   &nbsp;&nbsp;262656    (?, 512)            (512, 512)      
G_mapping/Dense1   &nbsp;&nbsp; 262656    (?, 512)            (512, 512)      
G_mapping/Dense2   &nbsp;&nbsp;262656    (?, 512)            (512, 512)      
G_mapping/Dense3    &nbsp;&nbsp;262656    (?, 512)            (512, 512)      
G_mapping/Dense4    &nbsp;&nbsp;   262656    (?, 512)            (512, 512)      
G_mapping/Dense5   &nbsp;&nbsp;      262656    (?, 512)            (512, 512)      
G_mapping/Dense6   &nbsp;&nbsp;      262656    (?, 512)            (512, 512)      
G_mapping/Dense7   &nbsp;&nbsp;    262656    (?, 512)            (512, 512)      
G_mapping/Broadcast  &nbsp;&nbsp;   -         (?, 14, 512)        -               
dlatent_avg     &nbsp;&nbsp;        -         (512,)              -               
Truncation/Lerp   &nbsp;&nbsp;     -         (?, 14, 512)        -               
G_synthesis/4x4/Const         8192      (?, 512, 4, 4)      (1, 512, 4, 4)  
G_synthesis/4x4/Conv          2622465   (?, 512, 4, 4)      (3, 3, 512, 512)
G_synthesis/4x4/ToRGB         264195    (?, 3, 4, 4)        (1, 1, 512, 3)  
G_synthesis/8x8/Conv0_up      2622465   (?, 512, 8, 8)      (3, 3, 512, 512)
G_synthesis/8x8/Conv1         2622465   (?, 512, 8, 8)      (3, 3, 512, 512)
G_synthesis/8x8/Upsample      -         (?, 3, 8, 8)        -               
G_synthesis/8x8/ToRGB         264195    (?, 3, 8, 8)        (1, 1, 512, 3)  
G_synthesis/16x16/Conv0_up    2622465   (?, 512, 16, 16)    (3, 3, 512, 512)
G_synthesis/16x16/Conv1       2622465   (?, 512, 16, 16)    (3, 3, 512, 512)
G_synthesis/16x16/Upsample    -         (?, 3, 16, 16)      -               
G_synthesis/16x16/ToRGB       264195    (?, 3, 16, 16)      (1, 1, 512, 3)  
G_synthesis/32x32/Conv0_up    2622465   (?, 512, 32, 32)    (3, 3, 512, 512)
G_synthesis/32x32/Conv1       2622465   (?, 512, 32, 32)    (3, 3, 512, 512)
G_synthesis/32x32/Upsample    -         (?, 3, 32, 32)      -               
G_synthesis/32x32/ToRGB       264195    (?, 3, 32, 32)      (1, 1, 512, 3)  
G_synthesis/64x64/Conv0_up    2622465   (?, 512, 64, 64)    (3, 3, 512, 512)
G_synthesis/64x64/Conv1       2622465   (?, 512, 64, 64)    (3, 3, 512, 512)
G_synthesis/64x64/Upsample    -         (?, 3, 64, 64)      -               
G_synthesis/64x64/ToRGB       264195    (?, 3, 64, 64)      (1, 1, 512, 3)  
G_synthesis/128x128/Conv0_up  1442561   (?, 256, 128, 128)  (3, 3, 512, 256)
G_synthesis/128x128/Conv1     721409    (?, 256, 128, 128)  (3, 3, 256, 256)
G_synthesis/128x128/Upsample  -         (?, 3, 128, 128)    -               
G_synthesis/128x128/ToRGB     132099    (?, 3, 128, 128)    (1, 1, 256, 3)  
G_synthesis/256x256/Conv0_up  426369    (?, 128, 256, 256)  (3, 3, 256, 128)
G_synthesis/256x256/Conv1     213249    (?, 128, 256, 256)  (3, 3, 128, 128)
G_synthesis/256x256/Upsample  -         (?, 3, 256, 256)    -               
G_synthesis/256x256/ToRGB     66051     (?, 3, 256, 256)    (1, 1, 128, 3)  
---                           ---       ---                 ---             
Total                         30034338                                      


D                    Params    OutputShape         WeightShape     
---                  ---       ---                 ---             
images_in            -         (?, 3, 256, 256)    -               
labels_in            -         (?, 0)              -               
256x256/FromRGB      512       (?, 128, 256, 256)  (1, 1, 3, 128)  
256x256/Conv0        147584    (?, 128, 256, 256)  (3, 3, 128, 128)
256x256/Conv1_down   295168    (?, 256, 128, 128)  (3, 3, 128, 256)
256x256/Skip         32768     (?, 256, 128, 128)  (1, 1, 128, 256)
128x128/Conv0        590080    (?, 256, 128, 128)  (3, 3, 256, 256)
128x128/Conv1_down   1180160   (?, 512, 64, 64)    (3, 3, 256, 512)
128x128/Skip         131072    (?, 512, 64, 64)    (1, 1, 256, 512)
64x64/Conv0          2359808   (?, 512, 64, 64)    (3, 3, 512, 512)
64x64/Conv1_down     2359808   (?, 512, 32, 32)    (3, 3, 512, 512)
64x64/Skip           262144    (?, 512, 32, 32)    (1, 1, 512, 512)
32x32/Conv0          2359808   (?, 512, 32, 32)    (3, 3, 512, 512)
32x32/Conv1_down     2359808   (?, 512, 16, 16)    (3, 3, 512, 512)
32x32/Skip           262144    (?, 512, 16, 16)    (1, 1, 512, 512)
16x16/Conv0          2359808   (?, 512, 16, 16)    (3, 3, 512, 512)
16x16/Conv1_down     2359808   (?, 512, 8, 8)      (3, 3, 512, 512)
16x16/Skip           262144    (?, 512, 8, 8)      (1, 1, 512, 512)
8x8/Conv0            2359808   (?, 512, 8, 8)      (3, 3, 512, 512)
8x8/Conv1_down       2359808   (?, 512, 4, 4)      (3, 3, 512, 512)
8x8/Skip             262144    (?, 512, 4, 4)      (1, 1, 512, 512)
4x4/MinibatchStddev  -         (?, 513, 4, 4)      -               
4x4/Conv             2364416   (?, 512, 4, 4)      (3, 3, 513, 512)
4x4/Dense0           4194816   (?, 512)            (8192, 512)     
Output               513       (?, 1)              (512, 1)        
---                  ---       ---                 ---             
Total                28864129                                      

Exporting sample images...
Replicating networks across 1 GPUs...
Initializing augmentations...
Setting up optimizers...
Constructing training graph...
Finalizing training ops...
Initializing metrics...
Training for 25000 kimg...

tick 0     kimg 0.0      time 1m 56s       sec/tick 15.3    sec/kimg 954.12  maintenance 100.9  gpumem 5.7   augment 0.000 
Evaluating metrics...

Downloading https://nvlabs-fi-cdn.nvidia.com/stylegan2ada/pretrained/metrics/inception_v3_features.pkl ... done
Calculating real image statistics for fid50k_full...
tcmalloc: large alloc 4294967296 bytes == 0x7f93471cc000 @  0x7f9b908a6001 ~~~
tcmalloc: large alloc 4294967296 bytes == 0x7f92471cc000 @  0x7f9b908a41e7 ~~~
tcmalloc: large alloc 4294967296 bytes == 0x7f92471cc000 @  0x7f9b908a41e7 ~~~

结果文件夹中只有第一个结果 (000000)，5 小时后没有任何反应。但运行时继续工作。

score 0 · Accepted Answer

如我所见，它一直停留在计算指标上。

问题是您在训练期间评估指标。请记住，指标通常比训练本身花费更多时间。也许计算fid50k ...，但你计算fid50k_full可能需要很长时间。

尝试使用参数完全禁用指标评估--metrics=none

tensorflow - StyleGAN2-ada，运行时运行，但 epoch 保持为零

1 回答 1

Related

Reference