python - Python 和 HyperOpt：如何进行多进程网格搜索？

Question

我正在尝试调整一些参数并且搜索空间非常大。到目前为止，我有 5 个维度，它可能会增加到 10 个左右。问题是，如果我能弄清楚如何对它进行多处理，我认为我可以获得显着的加速，但我找不到任何好的方法来做它。我正在使用hyperopt，但我不知道如何让它使用超过 1 个核心。这是我没有所有无关内容的代码：

from numpy    import random
from pandas   import DataFrame
from hyperopt import fmin, tpe, hp, Trials





def calc_result(x):

    huge_df = DataFrame(random.randn(100000, 5), columns=['A', 'B', 'C', 'D', 'E'])

    total = 0

    # Assume that I MUST iterate
    for idx_and_row in huge_df.iterrows():
        idx = idx_and_row[0]
        row = idx_and_row[1]


        # Assume there is no way to optimize here
        curr_sum = row['A'] * x['adjustment_1'] + \
                   row['B'] * x['adjustment_2'] + \
                   row['C'] * x['adjustment_3'] + \
                   row['D'] * x['adjustment_4'] + \
                   row['E'] * x['adjustment_5']


        total += curr_sum

    # In real life I want the total as high as possible, but for the minimizer, it has to negative a negative value
    total_as_neg = total * -1

    print(total_as_neg)

    return total_as_neg


space = {'adjustment_1': hp.quniform('adjustment_1', 0, 1, 0.001),
         'adjustment_2': hp.quniform('adjustment_2', 0, 1, 0.001),
         'adjustment_3': hp.quniform('adjustment_3', 0, 1, 0.001),
         'adjustment_4': hp.quniform('adjustment_4', 0, 1, 0.001),
         'adjustment_5': hp.quniform('adjustment_5', 0, 1, 0.001)}

trials = Trials()

best = fmin(fn        = calc_result,
            space     = space,
            algo      = tpe.suggest,
            max_evals = 20000,
            trials    = trials)

到目前为止，我有 4 个内核，但我基本上可以根据需要获得尽可能多的内核。我怎样才能hyperopt使用超过 1 个核心，或者是否有一个可以多进程的库？

score 5 · Accepted Answer

如果您有 Mac 或 Linux（或 Windows Linux 子系统），您可以添加大约 10 行代码来与ray. 如果您在此处通过最新的轮子安装 ray ，那么您可以运行您的脚本并进行最少的修改，如下所示，使用 HyperOpt 进行并行/分布式网格搜索。在高层次上，它fmin与 tpe.suggest 一起运行，并以并行方式在内部创建一个 Trials 对象。

from numpy    import random
from pandas   import DataFrame
from hyperopt import fmin, tpe, hp, Trials


def calc_result(x, reporter):  # add a reporter param here

    huge_df = DataFrame(random.randn(100000, 5), columns=['A', 'B', 'C', 'D', 'E'])

    total = 0

    # Assume that I MUST iterate
    for idx_and_row in huge_df.iterrows():
        idx = idx_and_row[0]
        row = idx_and_row[1]


        # Assume there is no way to optimize here
        curr_sum = row['A'] * x['adjustment_1'] + \
                   row['B'] * x['adjustment_2'] + \
                   row['C'] * x['adjustment_3'] + \
                   row['D'] * x['adjustment_4'] + \
                   row['E'] * x['adjustment_5']


        total += curr_sum

    # In real life I want the total as high as possible, but for the minimizer, it has to negative a negative value
    # total_as_neg = total * -1

    # print(total_as_neg)

    # Ray will negate this by itself to feed into HyperOpt
    reporter(timesteps_total=1, episode_reward_mean=total)

    return total_as_neg


space = {'adjustment_1': hp.quniform('adjustment_1', 0, 1, 0.001),
         'adjustment_2': hp.quniform('adjustment_2', 0, 1, 0.001),
         'adjustment_3': hp.quniform('adjustment_3', 0, 1, 0.001),
         'adjustment_4': hp.quniform('adjustment_4', 0, 1, 0.001),
         'adjustment_5': hp.quniform('adjustment_5', 0, 1, 0.001)}

import ray
import ray.tune as tune
from ray.tune.hpo_scheduler import HyperOptScheduler

ray.init()
tune.register_trainable("calc_result", calc_result)
tune.run_experiments({"experiment": {
    "run": "calc_result",
    "repeat": 20000,
    "config": {"space": space}}}, scheduler=HyperOptScheduler())

score 1 · Accepted Answer

您可以使用它multiprocessing来运行任务，这些任务通过绕过 Python 的全局解释器锁，在可用的多个处理器中有效地并发运行。

要运行多处理任务，必须实例化 a并让该对象在可迭代对象上Pool执行函数。map

该函数map只是将一个函数应用于一个可迭代对象的每个元素，例如一个列表，然后返回另一个包含元素的列表。

以搜索为例，这会从列表中获取所有大于 5 的项目：

from multiprocessing import Pool

def filter_gt_5(x):
   for i in x:
       if i > 5
           return i

if __name__ == '__main__':
    p = Pool(4)
    a_list = [6, 5, 4, 3, 7, 8, 10, 9, 2]
    #find a better way to split your list.
    lists = p.map(filter_gt_5, [a_list[:3], a_list[3:6], a_list[6:])
    #this will join the lists in one.
    filtered_list = list(chain(*lists))

在您的情况下，您将不得不拆分您的搜索库。

score 1 · Accepted Answer

您可以通过使用 SparkTrials() 而不是 hyperopt 中的 Trials() 来实现您的要求。

请参阅此处的文档。

SparkTrials API：
SparkTrials 可以通过 3 个参数进行配置，所有这些参数都是可选的：

parallelism

同时评估的最大试验次数。更高的并行性允许对更多超参数设置进行横向扩展测试。默认为 Spark 执行器的数量。

权衡取舍：该parallelism参数可以与中的max_evals参数一起设置fmin()。Hyperopt 将测试 max_evals您的超参数的总设置，批量大小为parallelism。如果parallelism = max_evals，则 Hyperopt 将进行随机搜索：它将选择所有超参数设置进行独立测试，然后并行评估它们。如果parallelism = 1，那么 Hyperopt 可以充分利用自适应算法，如 Parzen 估计树 (TPE)，它迭代地探索超参数空间：每个测试的新超参数设置都将根据之前的结果进行选择。介于parallelism两者之间1并max_evals允许您在可扩展性（更快地获得结果）和适应性（有时获得更好的模型）之间进行权衡。

限制：目前并行度的硬上限为 128。SparkTrials还将检查集群的配置以查看 Spark 允许多少并发任务；如果并行度超过此最大值，SparkTrials则将并行度降低到此最大值。

代码片段：

from hyperopt import SparkTrials, fmin, hp, tpe, STATUS_OK

spark_trials = SparkTrials(parallelism= no. of cores)

best_hyperparameters = fmin(
  fn=train,
  space=search_space,
  algo=algo,
  max_evals=32)

另一个有用的参考：

score 0 · Accepted Answer

只是关于你的问题的一些旁注。我最近也在做超参数搜索，如果你有自己的原因，请忽略我。

事情是你应该更喜欢随机搜索而不是网格搜索。

这是他们提出这一点的论文。

这里有一些解释：基本上随机搜索更好地分布在子特征上，网格搜索更好地分布在整个特征空间上，这就是为什么这感觉是要走的路。

图片来自这里

python - Python 和 HyperOpt：如何进行多进程网格搜索？

4 回答 4

Related

Reference