我正在一个具有 MultiDiscrete 观察和动作空间的健身房环境中实现基于 stable-baseline3 的 A2C 的 RL 代理。学习时出现以下错误
RuntimeError: Class values must be smaller than num_classes.
这是一个典型的 PyTorch 错误,但我不知道它的起源。我附上我的代码。在代码之前,我解释了这个想法。我们训练一个自定义环境,我们有几台机器(我们首先只训练两台机器),需要在机器坏掉之前决定它们的生产率。动作空间还包括在某个时间距离内安排维护的决定,并且对于每台机器,它决定要维护哪台机器。因此,观察空间是每台机器的消耗状态和计划维护的时间距离(也可以是“未计划的”),而行动空间是每台机器的生产率、每台机器的维护决策和呼叫日程。当总产量超过阈值时给予奖励,负奖励是维护和调度的成本。现在,我知道这是一件大事,我们需要减少这些空间,但实际问题是 PyTorch 的这个错误。我看不出它是从哪里来的。A2C 在观察和动作中都处理多离散空间,但我不知道它的起源。我们使用 MlpPolicy 设置 A2C 算法,并尝试在此环境下训练策略。我附上代码。
from gym import Env
from gym.spaces import MultiDiscrete
import numpy as np
from numpy.random import poisson
import random
from functools import reduce
# from tensorflow.keras.models import Sequential
# from tensorflow.keras.layers import Dense, Flatten
# from tensorflow.keras.optimizers import Adam
from stable_baselines3 import A2C
from stable_baselines3.common.env_checker import check_env
class MaintenanceEnv(Env):
def __init__(self, max_machine_states_vec, production_rates_vec, production_threshold, scheduling_horizon, operations_horizon = 100):
"""
Returns:
self.action_space is a vector with the maximum production rate fro each machine, a binary call-to-maintenance and a binary call-to-schedule
"""
num_machines = len(max_machine_states_vec)
assert len(max_machine_states_vec) == len(production_rates_vec), "Machine states and production rates have different cardinality"
# Actions we can take, down, stay, up
self.action_space = MultiDiscrete(production_rates_vec + num_machines*[2] + [2]) ### Action space is the production rate from 0 to N and the choice of scheduling
# Temperature array
self.observation_space = MultiDiscrete(max_machine_states_vec + [scheduling_horizon+2]) ### Observation space is the 0,...,L for each machine + the scheduling state including "ns" (None = "ns")
# Set start temp
self.state = num_machines*[0] + [0]
# Set shower length
self.operations_horizon = operations_horizon
self.time_to_finish = operations_horizon
self.scheduling_horizon = scheduling_horizon
self.max_states = max_machine_states_vec
self.production_threshold = production_threshold
def step(self, action):
"""
Notes: Schedule state
"""
num_machines = len(self.max_states)
maintenance_distance_index = -1
reward = 0
done = False
info = {}
### Cost parameters
cost_setup_schedule = 5
cost_preventive_maintenance = 10
cost_corrective_maintenance = 50
reward_excess_on_production = 5
cost_production_deficit = 10
cost_fixed_penalty = 10
failure_reward = -10**6
amount_produced = 0
### Errors
if action[maintenance_distance_index] == 1 and self.state[-1] != self.scheduling_horizon + 1: # Case when you set a reparation scheduled, but it is already scheduled. Not possible.
reward = failure_reward ###It should not be possible
done = True
return self.state, reward, done, info
if self.state[-1] == 0:
for pos in range(num_machines):
if action[num_machines + pos] == 1 and self.state[maintenance_distance_index] > 0: ### Case when maintenance is applied, but schedule is not involved yet. Not possible.
reward = failure_reward ### It should not be possible
done = True
return self.state, reward, done, info
for pos in range(num_machines):
if self.state[pos] == self.max_states[pos] and action[pos] > 0: # Case when machine is broken, but it is producing
reward = failure_reward ### It should not be possible
done = True
return self.state, reward, done, info
if self.state[maintenance_distance_index] == 0:
for pos in range(num_machines):
if action[num_machines+pos] == 1 and action[pos] > 0 : ### Case when it is maintenance time but the machines to be maintained keeps working. Not possible
reward = failure_reward ### It should not be possible
done = True
return self.state, reward, done, info
### State update
for pos in range(num_machines):
if self.state[pos] < self.max_states[pos] and self.state[maintenance_distance_index] > 0: ### The machine is in production, state update includes product amount
# self.state[pos] = min(self.max_states[pos] , self.state[pos] + poisson(action[pos] / self.action_space[pos])) ### Temporary: for I delete from the state the result of a poisson distribution depending on the production rate, Poisson is temporary
self.state[pos] = min(self.max_states[pos] , self.state[pos] + action[pos]) ### Temporary: Consumption rate is deterministic
amount_produced += action[pos]
if amount_produced >= self.production_threshold:
reward += reward_excess_on_production * (amount_produced - self.production_threshold)
else:
reward -= cost_production_deficit * (self.production_threshold - amount_produced)
reward -= cost_fixed_penalty
if action[maintenance_distance_index] == 1 and self.state[maintenance_distance_index] == self.scheduling_horizon + 1: ### You call a schedule when the state is not scheduled
self.state[maintenance_distance_index] = self.scheduling_horizon
reward -= cost_setup_schedule
elif self.state[maintenance_distance_index] > 0 and self.state[maintenance_distance_index] <= self.scheduling_horizon: ### You reduced the distance from scheduled maintenance
self.state[maintenance_distance_index] -= 1
for pos in range(num_machines): ### Case when we are repairing the machines and we need to pay the costs of repairment, and set them as new
if action[num_machines+pos] == 1 :
if self.state[pos] < self.max_states[pos]:
reward -= cost_preventive_maintenance
elif self.state[pos] == self.max_states[pos]:
reward -= cost_corrective_maintenance
self.state[pos] = 0
if self.state[maintenance_distance_index] == 0: ### when maintenance have been performed, reset the scheduling state to "not scheduled"
self.state[maintenance_distance_index] = self.scheduling_horizon + 1
### Time threshold
if self.time_to_finish > 0:
self.time_to_finish -= 1
else:
done = True
# Return step information
return self.state, reward, done, info
def render(self):
# Implement viz
pass
def reset(self):
# Reset shower temperature
num_machines = len(self.max_states)
self.state = np.array(num_machines*[0] + [0])
self.time_to_finish = self.operations_horizon
return self.state
def build_model(states, actions):
model = Sequential()
model.add(Dense(24, activation='relu', input_shape=states)) #
model.add(Dense(24, activation='relu'))
model.add(Dense(actions, activation='linear'))
return model
if __name__ == "__main__":
###GLOBAL COSTANTS AND PARAMETERS
NUMBER_MACHINES = 2
FAILURE_STATE_LIMIT = 8
MAXIMUM_PRODUCTION_RATE = 5
SCHEDULING_HORIZON = 4
PRODUCTION_THRESHOLD = 20
machine_states = NUMBER_MACHINES * [4]
failure_states = NUMBER_MACHINES * [FAILURE_STATE_LIMIT]
production_rates = NUMBER_MACHINES * [MAXIMUM_PRODUCTION_RATE]
### Setting environment
env = MaintenanceEnv(failure_states, production_rates, PRODUCTION_THRESHOLD, SCHEDULING_HORIZON)
model = A2C("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=10000)
obs = env.reset()
for i in range(1000):
action, _state = model.predict(obs, deterministic=True)
obs, reward, done, info = env.step(action)
# env.render()
if done:
obs = env.reset()
我觉得这是由于 MultiDiscrete 空间,但我寻求帮助。谢谢 :)