algorithm - 非二元奖励的多臂土匪汤普森采样

问问题 2016-06-19T18:08:12.567

1157 次

我使用以下行在每次试验中更新我的 beta 分布并给出 arm 推荐（我使用 scipy.stats.beta）：

self.prior = (1.0,1.0)
def get_recommendation(self):
    sampled_theta = []
    for i in range(self.arms):
        #Construct beta distribution for posterior
        dist = beta(self.prior[0]+self.successes[i],
                    self.prior[1]+self.trials[i]-self.successes[i])
        #Draw sample from beta distribution
        sampled_theta += [ dist.rvs() ]
    # Return the index of the sample with the largest value
    return sampled_theta.index( max(sampled_theta) )

但目前，它只适用于奖励是二元的（无论是成功还是失败）。我想对其进行修改，使其适用于非二元奖励。（例如奖励：2300、2000、...）。我怎么做？

algorithm - 非二元奖励的多臂土匪汤普森采样

0 回答 0

Related

Reference