1

我是强化学习的新手,我想使用这种技术处理音频信号。我构建了一个基本的阶跃函数,我希望将其展平以Gym OpenAI进行一般的强化学习。

为此,我使用由GoalEnv提供的,OpenAI因为我知道目标是什么,即平坦信号。那是具有输入和所需信号的图像:

图片取自 https://imgur.com/pgdlTWK

_set_action执行的 step 函数调用achieved_signal = convolution(input_signal,low_pass_filter) - offsetlow_pass_filter 也将截止频率作为输入。截止频率和偏移是作用于观察以获得输出信号的参数。L2-norm设计的奖励函数将输入信号和所需信号之间的帧返回到负值,以惩罚较大的范数。

以下是我创建的环境:

def butter_lowpass(cutoff, nyq_freq, order=4):
    normal_cutoff = float(cutoff) / nyq_freq
    b, a = signal.butter(order, normal_cutoff, btype='lowpass')
    return b, a

def butter_lowpass_filter(data, cutoff_freq, nyq_freq, order=4):
    b, a = butter_lowpass(cutoff_freq, nyq_freq, order=order)
    y = signal.filtfilt(b, a, data)
    return y

class `StepSignal(gym.GoalEnv)`:

    def __init__(self, input_signal, sample_rate, desired_signal):
        super(StepSignal, self).__init__()

        self.initial_signal = input_signal
        self.signal = self.initial_signal.copy()
        self.sample_rate = sample_rate
        self.desired_signal = desired_signal
        self.distance_threshold = 10e-1

        max_offset = abs(max( max(self.desired_signal) , max(self.signal))
                 - min( min(self.desired_signal) , min(self.signal)) )

        self.action_space = spaces.Box(low=np.array([10e-4,-max_offset]),\
high=np.array([self.sample_rate/2-0.1,max_offset]), dtype=np.float16)

        obs = self._get_obs()
        self.observation_space = spaces.Dict(dict(
        desired_goal=spaces.Box(-np.inf, np.inf, shape=obs['achieved_goal'].shape, dtype='float32'),
        achieved_goal=spaces.Box(-np.inf, np.inf, shape=obs['achieved_goal'].shape, dtype='float32'),
        observation=spaces.Box(-np.inf, np.inf, shape=obs['observation'].shape, dtype='float32'),
        ))

    def step(self, action):
        range = self.action_space.high - self.action_space.low
        action = range / 2 * (action + 1)
        self._set_action(action)
        obs = self._get_obs()
        done = False

        info = {
                'is_success': self._is_success(obs['achieved_goal'], self.desired_signal),
               }
        reward = -self.compute_reward(obs['achieved_goal'],self.desired_signal)
        return obs, reward, done, info

    def reset(self):
        self.signal = self.initial_signal.copy()
        return self._get_obs()


    def _set_action(self, actions):
        actions = np.clip(actions,a_max=self.action_space.high,a_min=self.action_space.low)
        cutoff = actions[0]
        offset = actions[1]
        print(cutoff, offset)
        self.signal = butter_lowpass_filter(self.signal, cutoff, self.sample_rate/2) - offset

    def _get_obs(self):
        obs = self.signal
        achieved_goal = self.signal
        return {
        'observation': obs.copy(),
        'achieved_goal': achieved_goal.copy(),
        'desired_goal': self.desired_signal.copy(),
        }

    def compute_reward(self, goal_achieved, goal_desired):
        d = np.linalg.norm(goal_desired-goal_achieved)
        return d


    def _is_success(self, achieved_goal, desired_goal):
        d = self.compute_reward(achieved_goal, desired_goal)
        return (d < self.distance_threshold).astype(np.float32)

然后可以将环境实例化为变量,并通过FlattenDictWrapper此处建议的https://openai.com/blog/ingredients-for-robotics-research/(页面末尾)进行展平。

length = 20
sample_rate = 30 # 30 Hz
in_signal_length = 20*sample_rate # 20sec signal
x = np.linspace(0, length, in_signal_length)

# Desired output
y = 3*np.ones(in_signal_length)
# Step signal
in_signal = 0.5*(np.sign(x-5)+9)

env = gym.make('stepsignal-v0', input_signal=in_signal, sample_rate=sample_rate, desired_signal=y)
env = gym.wrappers.FlattenDictWrapper(env, dict_keys=['observation','desired_goal'])
env.reset()

该代理是来自 的 DDPG 代理keras-rl,因为操作可以在环境中描述的连续 action_space 中采用任何值。我想知道为什么演员和评论家网络需要一个额外维度的输入,在input_shape=(1,) + env.observation_space.shape

nb_actions = env.action_space.shape[0]

# Building Actor agent (Policy-net)
actor = Sequential()
actor.add(Flatten(input_shape=(1,) + env.observation_space.shape, name='flatten'))
actor.add(Dense(128))
actor.add(Activation('relu'))
actor.add(Dense(64))
actor.add(Activation('relu'))
actor.add(Dense(nb_actions))
actor.add(Activation('linear'))
actor.summary()

# Building Critic net (Q-net)
action_input = Input(shape=(nb_actions,), name='action_input')
observation_input = Input(shape=(1,) + env.observation_space.shape, name='observation_input')
flattened_observation = Flatten()(observation_input)
x = Concatenate()([action_input, flattened_observation])
x = Dense(128)(x)
x = Activation('relu')(x)
x = Dense(64)(x)
x = Activation('relu')(x)
x = Dense(1)(x)
x = Activation('linear')(x)
critic = Model(inputs=[action_input, observation_input], outputs=x)
critic.summary()

# Building Keras agent
memory = SequentialMemory(limit=2000, window_length=1)
policy = BoltzmannQPolicy()
random_process = OrnsteinUhlenbeckProcess(size=nb_actions, theta=0.6, mu=0, sigma=0.3)
agent = DDPGAgent(nb_actions=nb_actions, actor=actor, critic=critic, critic_action_input=action_input,
                  memory=memory, nb_steps_warmup_critic=2000, nb_steps_warmup_actor=10000,
                  random_process=random_process, gamma=.99, target_model_update=1e-3)
agent.compile(Adam(lr=1e-3, clipnorm=1.), metrics=['mae'])

最后,训练代理:

filename = 'mem20k_heaviside_flattening'
hist = agent.fit(env, nb_steps=10, visualize=False, verbose=2, nb_max_episode_steps=5)
with open('./history_dqn_test_'+ filename + '.pickle', 'wb') as handle:
        pickle.dump(hist.history, handle, protocol=pickle.HIGHEST_PROTOCOL)
        agent.save_weights('h5f_files/dqn_{}_weights.h5f'.format(filename), overwrite=True)

现在有一个问题:对于我的环境的同一个实例,代理似乎总是被困在所有剧集中的相同输出值邻域:

图片取自 https://imgur.com/kaKhZNF `

累积奖励是负的,因为我只是让代理获得负奖励。我从https://github.com/openai/gym/blob/master/gym/envs/robotics/fetch_env.py使用它,它是 OpenAI 代码的一部分作为示例。在一个情节中,我应该得到不同的动作集收敛到一个(cutoff_final,offset_final),这将使我的输入阶跃信号接近我的输出平坦信号,这显然不是这种情况。此外,我想,对于连续剧集,我应该得到不同的动作。

4

1 回答 1

0

我想知道为什么演员和评论家网络需要一个额外维度的输入,在 input_shape=(1,) + env.observation_space.shape

我认为它GoalEnv的设计考虑了HER(后见之明体验重放),因为它将使用内部的“子空间”observation_space从稀疏的奖励信号中学习(OpenAI 网站上有一篇论文解释了HER的工作原理)。还没有看实现,但我的猜测是需要额外的输入,因为HER也处理“目标”参数。

由于您似乎没有使用 HER(适用于任何非策略算法,包括 DQN、DDPG 等),您应该手工制作一个信息丰富的奖励函数(奖励不是二元的,例如,如果目标实现则为 1,否则为 0)并使用基Env类。奖励应该在方法内部计算step,因为 MDP 中的奖励是像r(s, a, s`)这样的函数,您可能会拥有所需的所有信息。希望能帮助到你。

于 2019-11-10T03:14:41.157 回答