1

我正在尝试使用gym-retro 和keras-rl 的DQNAgent 来训练Gradius,但效果不佳。奖励不增加,损失继续增加。我不明白出了什么问题。

部分输出如下。

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_1 (Conv2D)            (None, 32, 30, 28)        8224      
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 16, 15, 64)        28736     
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 16, 15, 64)        36928     
_________________________________________________________________
flatten_1 (Flatten)          (None, 15360)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 256)               3932416   
_________________________________________________________________
dense_2 (Dense)              (None, 36)                9252      
=================================================================
Total params: 4,015,556
Trainable params: 4,015,556
Non-trainable params: 0
_________________________________________________________________
None
Training for 1500000 steps ...
    2339/1500000: episode: 1, duration: 47.685s, episode steps: 2339, steps per second: 49, episode reward: 2500.000, mean reward: 1.069 [0.000, 100.000], mean action: 19.122 [0.000, 35.000], mean observation: 0.029 [0.000, 0.980], loss: 36.018083, mean_absolute_error: 11.380395, mean_q: 18.252860
    3936/1500000: episode: 2, duration: 51.391s, episode steps: 1597, steps per second: 31, episode reward: 1800.000, mean reward: 1.127 [0.000, 100.000], mean action: 19.312 [0.000, 35.000], mean observation: 0.027 [0.000, 0.980], loss: 64.386497, mean_absolute_error: 54.420486, mean_q: 68.424599
    6253/1500000: episode: 3, duration: 75.020s, episode steps: 2317, steps per second: 31, episode reward: 3500.000, mean reward: 1.511 [0.000, 100.000], mean action: 16.931 [0.000, 35.000], mean observation: 0.029 [0.000, 0.980], loss: 177.966461, mean_absolute_error: 153.478119, mean_q: 177.061630




#(snip)





 1493035/1500000: episode: 525, duration: 95.634s, episode steps: 2823, steps per second: 30, episode reward: 5100.000, mean reward: 1.807 [0.000, 500.000], mean action: 19.664 [0.000, 35.000], mean observation: 0.034 [0.000, 0.980], loss: 26501204410368.000000, mean_absolute_error: 86211024.000000, mean_q: 90254256.000000
 1495350/1500000: episode: 526, duration: 78.401s, episode steps: 2315, steps per second: 30, episode reward: 2500.000, mean reward: 1.080 [0.000, 100.000], mean action: 18.652 [0.000, 34.000], mean observation: 0.029 [0.000, 0.980], loss: 23247718449152.000000, mean_absolute_error: 84441184.000000, mean_q: 88424568.000000
 1497839/1500000: episode: 527, duration: 84.667s, episode steps: 2489, steps per second: 29, episode reward: 3700.000, mean reward: 1.487 [0.000, 500.000], mean action: 21.676 [0.000, 35.000], mean observation: 0.034 [0.000, 0.980], loss: 23432217493504.000000, mean_absolute_error: 80286264.000000, mean_q: 83946064.000000
done, took 49517.509 seconds
end!

该程序在我大学的服务器上运行,我正在使用 SSH 连接服务器。

“点冻结”的结果如下。

absl-py==0.7.1
alembic==1.0.10
asn1crypto==0.24.0
astor==0.8.0
async-generator==1.10
attrs==19.1.0
backcall==0.1.0
bleach==3.1.0
certifi==2019.3.9
certipy==0.1.3
cffi==1.12.3
chardet==3.0.4
cloudpickle==1.2.1
cryptography==2.6.1
cycler==0.10.0
decorator==4.4.0
defusedxml==0.6.0
EasyProcess==0.2.7
entrypoints==0.3
future==0.17.1
gast==0.2.2
google-pasta==0.1.7
grpcio==1.21.1
gym==0.13.0
gym-retro==0.7.0
h5py==2.9.0
idna==2.8
ipykernel==5.1.0
ipython==7.5.0
ipython-genutils==0.2.0
jedi==0.13.3
Jinja2==2.10.1
jsonschema==3.0.1
jupyter-client==5.2.4
jupyter-core==4.4.0
jupyterhub==1.0.0
jupyterhub-ldapauthenticator==1.2.2
jupyterlab==0.35.6
jupyterlab-server==0.2.0
Keras==2.2.4
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.0
kiwisolver==1.1.0
ldap3==2.6
Mako==1.0.10
Markdown==3.1.1
MarkupSafe==1.1.1
matplotlib==3.0.3
mistune==0.8.4
nbconvert==5.5.0
nbformat==4.4.0
notebook==5.7.8
numpy==1.16.4
oauthlib==3.0.1
pamela==1.0.0
pandocfilters==1.4.2
parso==0.4.0
pexpect==4.7.0
pickleshare==0.7.5
pipenv==2018.11.26
prometheus-client==0.6.0
prompt-toolkit==2.0.9
protobuf==3.8.0
ptyprocess==0.6.0
pyasn1==0.4.5
pycparser==2.19
pycurl==7.43.0
pyglet==1.3.2
Pygments==2.4.0
pygobject==3.20.0
pyOpenSSL==19.0.0
pyparsing==2.4.0
pyrsistent==0.15.2
python-apt==1.1.0b1+ubuntu0.16.4.2
python-dateutil==2.8.0
python-editor==1.0.4
PyVirtualDisplay==0.2.4
PyYAML==5.1.1
pyzmq==18.0.1
requests==2.21.0
scipy==1.3.0
Send2Trash==1.5.0
six==1.12.0
SQLAlchemy==1.3.3
tensorboard==1.14.0
tensorflow==1.14.0
tensorflow-estimator==1.14.0
tensorflow-gpu==1.14.0
termcolor==1.1.0
terminado==0.8.2
testpath==0.4.2
tornado==6.0.2
traitlets==4.3.2
unattended-upgrades==0.1
urllib3==1.24.3
virtualenv==16.5.0
virtualenv-clone==0.5.3
wcwidth==0.1.7
webencodings==0.5.1
Werkzeug==0.15.4
wrapt==1.11.2
xvfbwrapper==0.2.9

我怀疑第一个 conv2d 层有问题,可能与 SequentialMemory 的 window_length 有关。我在想第一个 conv2d 层没有正确地采用或卷积。所以我在 CustomProcessor 类的 process_state_batch 中对批次进行了排序。但是问题并没有解决。

我写的都在这里。

#import all i need

import retro
import keras as k
import numpy as np
import rl
import rl.memory
import rl.policy
import rl.agents.dqn
import rl.core
import sys
import gym
from PIL import Image

import tensorflow as tf
from keras.backend import tensorflow_backend

config = tf.ConfigProto(gpu_options=tf.GPUOptions(allow_growth=True))
session = tf.Session(config=config)
tensorflow_backend.set_session(session)

#set window size
win_size = (112,120)

#set log file
fo = open('log.txt', 'w')
sys.stdout = fo

from tensorflow.python.client import device_lib
device_lib.list_local_devices()

"""
keras add extra dimension for batch.
and add history dimention for SequentialMemory.
but conv2d isn't able to accept 5D input.
so i'd made my processor class.
CustomProcessor convert RGB input into Gray input.

and conv2d layer convolute data which's shape (win_len, win_hei, history).
so transpose batch.

for more information, ref url bellow.
https://github.com/keras-rl/keras-rl/issues/229
"""

class CustomProcessor(rl.core.Processor):

    def process_observation(self, observation):
        img = Image.fromarray(observation)
        img = img.resize(win_size).convert('L')
        tes = np.array(img) / 255
        return np.array(img) / 255


    #def process_state_batch(self, batch):
        #batch = batch.transpose(0,2,3,1)
        #print(batch.shape)
        #return batch


myprocessor = CustomProcessor()

"""
Gradius have action space which can take 9 action in same moment.
so i gotta discrete action space.
the way i'd taken is wrapping env class.
"""

class Discretizer(gym.ActionWrapper):

    def __init__(self, env):
        super(Discretizer, self).__init__(env)
        self._actions = [[0,0,0,0,0,0,0,0,0],
               [0,0,0,0,0,0,0,0,1],
               [1,0,0,0,0,0,0,0,0],
               [1,0,0,0,0,0,0,0,1],
               [0,0,0,0,1,0,0,0,0],
               [0,0,0,0,0,1,0,0,0],
               [0,0,0,0,0,0,1,0,0],
               [0,0,0,0,0,0,0,1,0],
               [0,0,0,0,1,0,1,0,0],
               [0,0,0,0,1,0,0,1,0],
               [0,0,0,0,0,1,0,1,0],
               [0,0,0,0,0,1,1,0,0],]
        for i in range(8):
            self._actions.append((np.array(self._actions[1]) + np.array(self._actions[i + 4])).tolist())
        for i in range(8):
            self._actions.append((np.array(self._actions[2]) + np.array(self._actions[i + 4])).tolist())
        for i in range(8):
            self._actions.append((np.array(self._actions[3]) + np.array(self._actions[i + 4])).tolist())
        self.actions = []
        for action in self._actions:
            env.get_action_meaning(action)
        self.action_space = gym.spaces.Discrete(len(self._actions))

    def action(self, a):
        return self._actions[a].copy()

env = retro.make(game="Gradius-Nes", record="./Record")
env = Discretizer(env)

import retro
import keras as k
import numpy as np
import rl
import rl.memory
import rl.policy
import rl.agents.dqn
import rl.core

import tensorflow as tf
from keras.backend import tensorflow_backend

config = tf.ConfigProto(gpu_options=tf.GPUOptions(allow_growth=True))
session = tf.Session(config=config)
tensorflow_backend.set_session(session)

nb_actions = env.action_space.n

normal = k.initializers.glorot_normal()
model = k.Sequential()
win_len = 4
model.add(k.layers.Conv2D(
    32, kernel_size=8, strides=4, padding="same",
    input_shape=(4,120,112), kernel_initializer=normal,
    activation="relu", data_format='channels_first'))
print("chack")
model.add(k.layers.Conv2D(
    64, kernel_size=4, strides=2, padding="same",
    kernel_initializer=normal,
    activation="relu"))
model.add(k.layers.Conv2D(
    64, kernel_size=3, strides=1, padding="same",
    kernel_initializer=normal,
    activation="relu"))
model.add(k.layers.Flatten())
model.add(k.layers.Dense(256, kernel_initializer=normal,
                         activation="relu"))
model.add(k.layers.Dense(nb_actions,
                         kernel_initializer=normal,
                         activation="linear"))

memory = rl.memory.SequentialMemory(limit=50000, window_length=win_len)
policy = rl.policy.EpsGreedyQPolicy()

"""
dqn = rl.agents.DQNAgent(processor=myprocessor, model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=10,
               target_model_update=1e-2, policy=policy)
"""
dqn = rl.agents.DQNAgent(processor=myprocessor, model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=1000,
               target_model_update=1e-2, policy=policy)

dqn.compile(k.optimizers.Adam(lr=1e-3), metrics=['mae'])
print(model.summary());
hist = dqn.fit(env, nb_steps=1500000, visualize=False, verbose=2)
print("end!")
dqn.save_weights("test_model.h5f", overwrite=True)

env.close()

ps:

我尝试了这些解决方案。1,添加最大池化层和密集层 2,使用梯度裁剪 3,向下亚当的 rl 但它仍然不起作用。代码如下。

#import all i need

import retro
import keras as k
import numpy as np
import rl
import rl.memory
import rl.policy
import rl.agents.dqn
import rl.core
import sys
import gym
from PIL import Image

import tensorflow as tf
from keras.backend import tensorflow_backend

config = tf.ConfigProto(gpu_options=tf.GPUOptions(allow_growth=True))
session = tf.Session(config=config)
tensorflow_backend.set_session(session)

#set window size
win_size = (224,240)

#set log file
#fo = open('log.txt', 'w')
#sys.stdout = fo

from tensorflow.python.client import device_lib
device_lib.list_local_devices()

"""
keras add extra dimension for batch.
and add history dimention for SequentialMemory.
but conv2d isn't able to accept 5D input.
so i'd made my processor class.
CustomProcessor convert RGB input into Gray input.

and conv2d layer convolute data which's shape (win_len, win_hei, history).
so transpose batch.

for more information, ref url bellow.
https://github.com/keras-rl/keras-rl/issues/229
"""

class CustomProcessor(rl.core.Processor):

    def process_observation(self, observation):
        img = Image.fromarray(observation)
        img = img.resize(win_size).convert('L')
        tes = np.array(img) / 255
        return np.array(img) / 255


    #def process_state_batch(self, batch):
        #batch = batch.transpose(0,2,3,1)
        #print(batch.shape)
        #return batch


myprocessor = CustomProcessor()

"""
Gradius have action space which can take 9 action in same moment.
so i gotta discrete action space.
the way i'd taken is wrapping env class.
"""

class Discretizer(gym.ActionWrapper):

    def __init__(self, env):
        super(Discretizer, self).__init__(env)
        self._actions = [[0,0,0,0,0,0,0,0,0],
               [0,0,0,0,0,0,0,0,1],
               [1,0,0,0,0,0,0,0,0],
               [1,0,0,0,0,0,0,0,1],
               [0,0,0,0,1,0,0,0,0],
               [0,0,0,0,0,1,0,0,0],
               [0,0,0,0,0,0,1,0,0],
               [0,0,0,0,0,0,0,1,0],
               [0,0,0,0,1,0,1,0,0],
               [0,0,0,0,1,0,0,1,0],
               [0,0,0,0,0,1,0,1,0],
               [0,0,0,0,0,1,1,0,0],]
        for i in range(8):
            self._actions.append((np.array(self._actions[1]) + np.array(self._actions[i + 4])).tolist())
        for i in range(8):
            self._actions.append((np.array(self._actions[2]) + np.array(self._actions[i + 4])).tolist())
        for i in range(8):
            self._actions.append((np.array(self._actions[3]) + np.array(self._actions[i + 4])).tolist())
        self.actions = []
        for action in self._actions:
            env.get_action_meaning(action)
        self.action_space = gym.spaces.Discrete(len(self._actions))

    def action(self, a):
        return self._actions[a].copy()

env = retro.make(game="Gradius-Nes", record="./Record")
env = Discretizer(env)

import retro
import keras as k
import numpy as np
import rl
import rl.memory
import rl.policy
import rl.agents.dqn
import rl.core

import tensorflow as tf
from keras.backend import tensorflow_backend

config = tf.ConfigProto(gpu_options=tf.GPUOptions(allow_growth=True))
session = tf.Session(config=config)
tensorflow_backend.set_session(session)

nb_actions = env.action_space.n

normal = k.initializers.glorot_normal()
model = k.Sequential()
win_len = 4
model.add(k.layers.Conv2D(
    32, kernel_size=8, strides=4, padding="same",activation="relu", 
    input_shape=(win_len,240,224), kernel_initializer=normal, data_format="channels_first"))
model.add(k.layers.MaxPooling2D(pool_size=(2, 2), strides=None, padding='same', data_format="channels_first"))
model.add(k.layers.Conv2D(
    64, kernel_size=4, strides=2, padding="same",activation="relu", 
    kernel_initializer=normal, data_format="channels_first"))
model.add(k.layers.MaxPooling2D(pool_size=(2, 2), strides=None, padding='same', data_format="channels_first"))
model.add(k.layers.Conv2D(
    64, kernel_size=3, strides=1, padding="same",activation="relu", 
    kernel_initializer=normal, data_format="channels_first"))
model.add(k.layers.MaxPooling2D(pool_size=(2, 2), strides=None, padding='same', data_format="channels_first"))
model.add(k.layers.Flatten())
model.add(k.layers.Dense(1024, kernel_initializer=normal, activation="relu"))
model.add(k.layers.Dense(1024, kernel_initializer=normal, activation="relu"))
model.add(k.layers.Dense(nb_actions,
                         kernel_initializer=normal,
                         activation="linear"))

memory = rl.memory.SequentialMemory(limit=50000, window_length=win_len)
policy = rl.policy.EpsGreedyQPolicy()

"""
dqn = rl.agents.DQNAgent(processor=myprocessor, model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=10,
               target_model_update=1e-2, policy=policy)
"""
dqn = rl.agents.DQNAgent(processor=myprocessor, model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=50000,
               target_model_update=1e-6, policy=policy)

dqn.compile(k.optimizers.Adam(lr=1e-7, clipnorm=1.), metrics=['mae'])
print(model.summary());
hist = dqn.fit(env, nb_steps=750000, visualize=False, verbose=2)
print("end!")
dqn.save_weights("test_model.h5f", overwrite=True)

env.close()
4

0 回答 0