TLDR
MCTS 代理实现在本地运行没有错误,针对启发式驱动的 minimax 实现了 >40% 的胜率,但自动评分器失败 - 这是提交项目之前的要求。自动分级机抛出
IndexError: Cannot choose from an empty sequence
。我正在寻找最有可能引发此异常的代码部分的建议。
嗨,我目前被困在这个项目上,我需要在两周后完成我注册的项目之前清除这个项目。我已经完成的任务是在两个国际象棋骑士之间的隔离游戏中实现一个代理来对抗启发式驱动的极小极大代理。游戏的完整实施细节可以在这里找到。对于我的项目,游戏将在一块 9 x 11 的板上进行,使用位板编码。我的 MCTS 实现很简单,紧跟本文提供的伪代码(第 6 页)。
本质上,通用 MCTS 方法包括这 4 个部分,它们分别由 CustomPlayer 类中的以下嵌套函数实现:
- 选择 - tree_policy
- 扩展 - best_child,扩展
- 模拟 - default_policy
反向传播 - backup_negamax, update_scores
import math import random import time import logging from copy import deepcopy from collections import namedtuple from sample_players import DataPlayer class CustomPlayer(DataPlayer): """ Implement your own agent to play knight's Isolation The get_action() method is the only required method for this project. You can modify the interface for get_action by adding named parameters with default values, but the function MUST remain compatible with the default interface. ********************************************************************** NOTES: - The test cases will NOT be run on a machine with GPU access, nor be suitable for using any other machine learning techniques. - You can pass state forward to your agent on the next turn by assigning any pickleable object to the self.context attribute. ********************************************************************** """ def get_action(self, state): """ Employ an adversarial search technique to choose an action available in the current state calls self.queue.put(ACTION) at least This method must call self.queue.put(ACTION) at least once, and may call it as many times as you want; the caller will be responsible for cutting off the function after the search time limit has expired. See RandomPlayer and GreedyPlayer in sample_players for more examples. ********************************************************************** NOTE: - The caller is responsible for cutting off search, so calling get_action() from your own code will create an infinite loop! Refer to (and use!) the Isolation.play() function to run games. ********************************************************************** """ logging.info("Move %s" % state.ply_count) self.queue.put(random.choice(state.actions())) i = 1 statlist = [] while (self.queue._TimedQueue__stop_time - 0.05) > time.perf_counter(): next_action = self.uct_search(state, statlist, i) self.queue.put(next_action) i += 1 def uct_search(self, state, statlist, i): plyturn = state.ply_count % 2 Stat = namedtuple('Stat', 'state action utility visit nround') def tree_policy(state): statecopy = deepcopy(state) while not statecopy.terminal_test(): # All taken actions at this depth tried = [s.action for s in statlist if s.state == statecopy] # See if there's any untried actions left untried = [a for a in statecopy.actions() if a not in tried] topop = [] toappend = [] if len(untried) > 0: next_action = random.choice(untried) statecopy = expand(statecopy, next_action) break else: next_action = best_child(statecopy, 1) for k, s in enumerate(statlist): if s.state == statecopy and s.action == next_action: visit1 = statlist[k].visit + 1 news = statlist[k]._replace(visit=visit1) news = news._replace(nround=i) topop.append(k) toappend.append(news) break update_scores(topop, toappend) statecopy = statecopy.result(next_action) return statecopy def expand(state, action): """ Returns a state resulting from taking an action from the list of untried nodes """ statlist.append(Stat(state, action, 0, 1, i)) return state.result(action) def best_child(state, c): """ Returns the state resulting from taking the best action. c value between 0 (max score) and 1 (prioritize exploration) """ # All taken actions at this depth tried = [s for s in statlist if s.state == state] maxscore = -999 maxaction = [] # Compute the score for t in tried: score = (t.utility/t.visit) + c * math.sqrt(2 * math.log(i)/t.visit) if score > maxscore: maxscore = score del maxaction[:] maxaction.append(t.action) elif score == maxscore: maxaction.append(t.action) if len(maxaction) < 1: logging.error("IndexError: maxaction is empty!") return random.choice(maxaction) def default_policy(state): """ The simulation to run when visiting unexplored nodes. Defaults to uniform random moves """ while not state.terminal_test(): state = state.result(random.choice(state.actions())) delta = state.utility(self.player_id) if abs(delta) == float('inf') and delta < 0: delta = -1 elif abs(delta) == float('inf') and delta > 0: delta = 1 return delta def backup_negamax(delta): """ Propagates the terminal utility up the search tree """ topop = [] toappend = [] for k, s in enumerate(statlist): if s.nround == i: if s.state.ply_count % 2 == plyturn: utility1 = s.utility + delta news = statlist[k]._replace(utility=utility1) elif s.state.ply_count % 2 != plyturn: utility1 = s.utility - delta news = statlist[k]._replace(utility=utility1) topop.append(k) toappend.append(news) update_scores(topop, toappend) return def update_scores(topop, toappend): # Remove outdated tuples. Order needs to be in reverse or pop will fail! for p in sorted(topop, reverse=True): statlist.pop(p) # Add the updated ones for a in toappend: statlist.append(a) return next_state = tree_policy(state) if not next_state.terminal_test(): delta = default_policy(next_state) backup_negamax(delta) return best_child(state, 0)
缺少颜色格式确实使代码很难阅读。所以,请随时在我的github上查看。我在本地运行游戏没有任何问题,我的 MCTS 代理对 minimax 玩家的胜率超过 40%(低于 150 毫秒/移动限制)。但是,当我尝试将我的代码提交给自动评分器时,它会被IndexError: Cannot choose from an empty sequence
异常拒绝。
根据我与课程代表的讨论,我们认为该错误很可能是由使用random.choice()
. 在我的实现中有 4 个使用它的实例:
- 第 39 行,在 MCTS 算法之前,向队列提供随机移动
- 第 66 行,随机选择一个没有尝试过的招式
- 第 114 行,如果最佳动作得分相同,则随机选择一个动作
- 第 122 行,随机模拟游戏,直到选择移动的最终状态
我假设游戏实现是正确的,只要状态是终端,调用 state.actions() 将始终返回可能移动的列表。因此,唯一可以触发此异常的实例是第 3 项。第 1 项和第 4 项只是从可用操作中随机选择,同时进行了显式检查以确保 random.choice() 没有提供空列表。因此,我将日志记录应用到第 3 项(即使在本地运行时没有引发异常),果然,在 50 场比赛后没有捕获任何异常。
对于这篇冗长的帖子,我深表歉意,但我确实希望那里的人能够抓住我在实施过程中遗漏的东西。