0

我已经使用 ray.tune 的 PB2 开始了几次试验。他们使用 8 个演员,每 20 步扰动一次。Actor 0-6 没有任何问题,但是 Actor 7 在第二个 20 步 epoch 中始终捕获错误。在终端中,我收到以下消息:

Traceback (most recent call last):  
  File "./tune_pb2.py", line 303, in <module>  
    raise_on_failed_trial=False)  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/ray/tune/tune.py", line 411, in run  
    runner.step()  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 572, in step  
    self.trial_executor.on_no_available_trials(self)  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/ray/tune/trial_executor.py", line 183, in on_no_available_trials  
    raise TuneError("There are paused trials, but no more pending "
ray.tune.error.TuneError: There are paused trials, but no more pending trials with sufficient resources.

我正在使用 2 个 gpus 和 2 个 cpus 进行训练,每个演员各一个。在这个过程中,actor 0-6 已经完成了第二个 epoch 并被暂停。演员 7 是唯一一个正在运行的演员。该试验的 error.txt 文件包含以下内容:

Traceback (most recent call last):  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 755, in _process_trial
    self, trial, flat_result)  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/ray/tune/schedulers/pbt.py", line 415, in on_trial_result
    lower_quantile)  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/ray/tune/schedulers/pbt.py", line 479, in _perturb_trial
    self._exploit(trial_runner.trial_executor, trial, trial_to_clone)  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/ray/tune/schedulers/pbt.py", line 532, in _exploit
    new_config = self._get_new_config(trial, trial_to_clone)  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/ray/tune/schedulers/pb2.py", line 357, in _get_new_config
    trial_to_clone.config)  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/ray/tune/schedulers/pb2.py", line 174, in explore
    X, y, current, newpoint, bounds, num_f=len(t_r.columns))  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/ray/tune/schedulers/pb2.py", line 83, in select_config
    m = GPy.models.GPRegression(X, y, kernel)  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/paramz/parameterized.py", line 58, in __call__
    self.initialize_parameter()  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/paramz/core/parameter_core.py", line 337, in initialize_parameter
    self.trigger_update()  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/paramz/core/updateable.py", line 79, in trigger_update
    self._trigger_params_changed(trigger_parent)  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/paramz/core/parameter_core.py", line 134, in _trigger_params_changed
    self.notify_observers(None, None if trigger_parent else -np.inf)  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/paramz/core/observable.py", line 91, in notify_observers
    [callble(self, which=which) for _, _, callble in self.observers]  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/paramz/core/observable.py", line 91, in <listcomp>
    [callble(self, which=which) for _, _, callble in self.observers]  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/paramz/core/parameter_core.py", line 508, in _parameters_changed_notification
    self.parameters_changed()  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/GPy/core/gp.py", line 267, in parameters_changed
    self.posterior, self._log_marginal_likelihood, self.grad_dict = self.inference_method.inference(self.kern, self.X, self.likelihood, self.Y_normalized, self.mean_function, self.Y_metadata)  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/GPy/inference/latent_function_inference/exact_gaussian_inference.py", line 53, in inference
    K = kern.K(X)  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/GPy/kern/src/kernel_slice_operations.py", line 110, in wrap
    ret = f(self, s.X, s.X2, *a, **kw)  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/ray/tune/schedulers/pb2_utils.py", line 42, in K
    dists = pairwise_distances(T1, T2, "cityblock")  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/sklearn/utils/validation.py", line 72, in inner_f
    return f(**kwargs)  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/sklearn/metrics/pairwise.py", line 1779, in pairwise_distances
    return _parallel_pairwise(X, Y, func, n_jobs, **kwds)  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/sklearn/metrics/pairwise.py", line 1360, in _parallel_pairwise
    return func(X, Y, **kwds)  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/sklearn/utils/validation.py", line 72, in inner_f
    return f(**kwargs)  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/sklearn/metrics/pairwise.py", line 781, in manhattan_distances
    X, Y = check_pairwise_arrays(X, Y)  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/sklearn/utils/validation.py", line 72, in inner_f
    return f(**kwargs)  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/sklearn/metrics/pairwise.py", line 147, in check_pairwise_arrays
    estimator=estimator)  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/sklearn/utils/validation.py", line 72, in inner_f
    return f(**kwargs)  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/sklearn/utils/validation.py", line 645, in check_array
    allow_nan=force_all_finite == 'allow-nan')  
  File "/home/john/anaconda3/envs/python3.7/lib/python3.7/site-packages/sklearn/utils/validation.py", line 99, in _assert_all_finite
    msg_dtype if msg_dtype is not None else X.dtype)  
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

看起来错误消息出现在 ray.tune 代码本身中,除非我遗漏了什么。如果我的调音代码是相关的,我也可以提供。

任何帮助将不胜感激。

4

0 回答 0