Hyperparameter optimization (HPO) remains one of the most critical and computationally demanding challenges in deploying machine learning models effectively. Unlike model parameters that are learned during training, hyperparameters govern the learning process itself and must be configured prior to training, making their selection both consequential and non-trivial. This review article provides a rigorous comparative analysis of established and emerging HPO techniques, spanning traditional exhaustive methods, probabilistic Bayesian approaches, evolutionary and metaheuristic strategies, gradient-based methods, reinforcement learning-driven search, and neural architecture search frameworks. The paper critically evaluates each family of methods with respect to search efficiency, computational cost, convergence behavior, scalability, and suitability across diverse model classes including support vector machines, random forests, gradient-boosted trees, convolutional neural networks, and recurrent architectures. A structured taxonomy is presented to organize the rapidly growing HPO landscape, accompanied by mathematical formulations of core optimization objectives, acquisition functions, and fitness criteria. Three comparative tables synthesize the literature and provide a side-by-side assessment of method capabilities. The review further addresses challenges unique to high-dimensional search spaces, noisy evaluation functions, and distributed optimization settings. Contemporary trends including AutoML pipelines, federated HPO, green AI, and HPO for large language models are analyzed for their trajectory and open research problems. The findings reveal that no single HPO method universally dominates across all settings; rather, method selection should be guided by problem dimensionality, evaluation budget, and the degree of prior knowledge available. Future research directions emphasizing warm-starting, meta-learning, and explainable HPO are identified as essential for advancing the field.
The performance of machine learning (ML) models is profoundly sensitive to hyperparameters configuration choices that exist outside the scope of gradient-based learning but directly shape its outcome. The learning rate in a neural network, the number of trees in a random forest, the regularization coefficient of a support vector machine, and the depth of a decision tree are all hyperparameters that must be determined before training commences. Even marginal differences in these settings can translate into significant discrepancies in generalization accuracy, training stability, and inference efficiency [1].
Historically, hyperparameter selection was treated as an art form, largely dependent on practitioner intuition, domain expertise, and manual trial-and-error. The exponential growth in model complexity particularly with deep neural networks has rendered such manual workflows untenable. Modern architectures may expose tens to hundreds of hyperparameters, and the interaction effects among these parameters create highly non-linear, non-convex, and often discontinuous objective landscapes [2]. Automated hyperparameter optimization has thus evolved from a convenience into a necessity for scalable, reproducible machine learning.
The field of HPO has matured substantially over the past decade. Early systematic approaches grid search and random search gave way to probabilistic surrogate models under the Bayesian optimization paradigm [3]. Concurrently, biologically inspired evolutionary algorithms and swarm intelligence methods offered derivative-free alternatives suited to mixed search spaces. More recently, gradient-based meta-learning approaches, reinforcement learning controllers, and differentiable architecture search methods have extended HPO to the domain of neural architecture design itself [4].
Despite this proliferation of methods, the literature lacks a unified comparative framework that situates techniques relative to one another across a consistent set of evaluation criteria. Surveys that have appeared in the literature tend to focus on specific subdomains Bayesian optimization [5], evolutionary computation [6], or AutoML [7] without providing the cross-paradigm synthesis that practitioners require to make informed method selection decisions.
This review addresses that gap. We present a comprehensive, critical comparison of HPO techniques organized along a taxonomy that spans the full methodological spectrum. We evaluate methods not merely in terms of reported benchmark performance but with respect to structural properties exploration-exploitation balance, scalability with respect to dimensionality and evaluation cost, robustness to noisy objectives, and compatibility with distributed computing environments. In doing so, we aim to provide both researchers and practitioners with a principled foundation for navigating the crowded HPO landscape.
The remainder of the paper is organized as follows. Section 2 provides background on ML models and the importance of HPO. Section 3 reviews the relevant literature. Section 4 presents a taxonomy of HPO methods. Sections 5 through 13 discuss specific method families in detail. Section 14 offers a comparative analysis, and Sections 15 through 18 address complexity, challenges, and future directions before concluding in Section 19.
Modern ML models span a wide spectrum of architectures, each characterized by its own hyperparameter space. Support Vector Machines (SVMs) are governed by the regularization constant , the kernel type (linear, RBF, polynomial), and kernel-specific parameters such as in the RBF kernel. Random Forests require configuration of the number of trees, maximum tree depth, minimum samples per leaf, and feature subsampling ratio [8]. XGBoost and gradient-boosted trees introduce additional complexity: learning rate, subsample ratio, column sampling parameters, and tree regularization coefficients and all interact in nuanced ways [9].
Deep Neural Networks (DNNs) expose the largest and most challenging hyperparameter spaces. The learning rate, batch size, optimizer choice, weight initialization strategy, dropout rates, number of layers, and layer width each affect training dynamics, and their effects are neither independent nor monotonic. Convolutional Neural Networks (CNNs) add architectural hyperparameters: kernel sizes, stride, padding, pooling strategies, and the number of filters per layer. Long Short-Term Memory (LSTM) networks require tuning of hidden state dimensionality, sequence length, gradient clipping thresholds, and forget gate biases, often compounded by sensitivity to learning rate schedules [10].
The importance of systematic HPO can be understood through the lens of the generalization gap. A model trained with suboptimal hyperparameters may exhibit significantly higher variance or bias than the same architecture with well-tuned settings, irrespective of the volume of training data or sophistication of the architecture [11]. Empirical studies have consistently demonstrated that HPO can recover several percentage points of classification accuracy that naive defaults leave on the table [1]. In high-stakes applications medical diagnosis, financial forecasting, autonomous systems this margin is operationally significant.
The HPO problem is structurally adversarial to standard optimization methods. The objective function validation performance as a function of hyperparameters is black-box (no gradient information), expensive to evaluate (each evaluation requires a full model training run), noisy (due to stochastic optimization and data splits), and non-convex with multiple local optima [12]. The search space is frequently mixed-type, combining continuous variables (learning rate), integers (number of layers), and categorical choices (activation function), with conditional dependencies (dropout rate is only relevant if dropout is enabled). High dimensionality further exacerbates the challenge: in deep learning, it is not uncommon to encounter 20–50 meaningful hyperparameters, making the search space astronomically large [13].
The formal study of automated HPO was catalyzed by the work of Bergstra and Bengio [14], who demonstrated that random search outperforms grid search for most practical HPO problems due to its superior coverage of the effective dimensions of the search space. This foundational insight displaced the naive exhaustive paradigm and motivated the development of more sophisticated adaptive methods.
Bayesian optimization with Gaussian Process surrogates was systematically applied to HPO by Snoek et al. [3], whose Spearmint framework set the standard for probabilistic HPO and introduced the use of Expected Improvement as an acquisition function. Subsequent work refined the acquisition function landscape: Hernández-Lobato et al. [15] introduced Predictive Entropy Search, and Wang and Jegelka [49] proposed Max-value Entropy Search (MES) with improved scalability.
The limitations of Gaussian Processes particularly their cubic computational complexity in the number of observations motivated tree-structured surrogate models. Bergstra et al. [17] introduced the Tree-structured Parzen Estimator (TPE), which models and separately and scales more gracefully to high-dimensional and conditional search spaces. TPE forms the backbone of the widely used Hyperopt library and has demonstrated strong empirical performance on deep learning benchmarks [18].
Evolutionary approaches to HPO have a longer history, with early genetic algorithm applications to neural network weight and architecture optimization reviewed by Yao [19]. More recently, CMA-ES (Covariance Matrix Adaptation Evolution Strategy) has been applied to HPO by Loshchilov and Hutter [20], demonstrating competitive performance in continuous spaces. Differential Evolution and Particle Swarm Optimization have been applied to SVM and ensemble method tuning with promising results [21].
Population-based Training (PBT), introduced by Jaderberg et al. [22], represented a paradigm shift by simultaneously training and optimizing hyperparameters across a population of models, allowing schedules particularly learning rate schedules to be discovered dynamically rather than fixed at initialization. This approach has proven especially impactful for reinforcement learning and large-scale vision models.
Neural Architecture Search (NAS) extended HPO from scalar hyperparameters to the architecture itself. The seminal NAS work of Zoph and Le [23] used reinforcement learning to search over architecture descriptions, achieving state-of-the-art accuracy at extraordinary computational cost. Subsequent efficiency improvements ENAS [24], DARTS [25], and One-Shot NAS [26] reduced the search cost by orders of magnitude through weight sharing and differentiable relaxation of the architecture choice.
The AutoML paradigm formalized the pipeline from raw data to deployed model as a unified optimization problem. Auto-WEKA [27], Auto-sklearn [28], and H2O AutoML integrate HPO with algorithm selection under the CASH (Combined Algorithm Selection and Hyperparameter optimization) framework. Recent work has further extended AutoML to handle neural architecture design, data preprocessing, and feature engineering jointly [29].
Multi-objective HPO, where accuracy is jointly optimized with model size, inference latency, or energy consumption, has received growing attention [30]. Pareto-front exploration under constrained budgets, combined with surrogate modeling, has emerged as a principled framework for resource-aware model deployment [31].
The HPO landscape can be organized into a hierarchical taxonomy based on the nature of the search strategy and the information used to guide successive evaluations. Figure 1 presents this taxonomy.
In addition to traditional, Bayesian, evolutionary, and learning-based approaches, recent work has introduced diversity-aware HPO methods that explicitly aim to improve coverage of the search space. ART-HPO is one such method, adapting adaptive random testing to reduce clustering among evaluated hyperparameter configurations [16]. This category is especially relevant when the evaluation budget is limited and broad exploration of the search space is desired.
Figure 1: Hierarchical taxonomy of HPO methods, organized by search paradigm and information model.
Figure 2 presents the general HPO workflow, where the optimizer repeatedly proposes candidate hyperparameters, evaluates model performance, and updates its search strategy until a stopping criterion is met, after which the best configuration is returned.
Figure 2: General HPO workflow applicable to all optimizer types
The HPO problem can be formally stated as follows. Let denote the hyperparameter search space, a product space of continuous, integer, and categorical domains:
where is the number of hyperparameters. The goal is to find:
where denotes a model trained with hyperparameters on training set , and is the validation loss evaluated on . The evaluation function is treated as a black-box oracle, as no closed-form gradient with respect to is generally available.
Grid Search (GS) discretizes each hyperparameter dimension into a finite set of candidate values and evaluates all combinations. For hyperparameters each with candidate values, GS requires evaluations an exponential growth that renders it computationally infeasible for or [14]. Despite its simplicity and exhaustiveness within the discretized space, GS suffers from the curse of dimensionality: for a 10-dimensional space with 5 values per dimension, evaluations are required.
The practical use of GS is therefore restricted to low-dimensional problems, coarse initial exploration, or final refinement around a known good region. Its advantage lies in full reproducibility and ease of parallelization, as evaluations are independent. However, GS wastes resources on unimportant dimensions: if only 2 of 10 hyperparameters materially affect performance, GS still exhaustively evaluates all combinations of the remaining 8 [14].
Bergstra and Bengio [14] demonstrated analytically and empirically that Random Search (RS) is more efficient than GS when only a subset of hyperparameters is important. By independently and uniformly sampling from continuous distributions rather than a discrete grid, RS covers the effective dimensions more densely per evaluation. For a budget of evaluations, the best value found along any single dimension follows the distribution of the maximum of uniform samples, whereas GS covers only distinct values per dimension.
RS remains a competitive baseline against which more sophisticated methods must justify their additional complexity. It is trivially parallelizable, requires no surrogate model, and carries no risk of surrogate model misspecification. Its primary limitation is the absence of adaptive refinement: RS does not learn from previous evaluations, making it inefficient for problems requiring more than a few dozen evaluations.
A related limitation of Random Search is that independently sampled configurations may cluster in some regions while leaving other parts of the hyperparameter space unexplored. ART-HPO addresses this issue by adapting adaptive random testing principles to hyperparameter optimization, encouraging newly selected configurations to be more spatially diverse from previously evaluated ones [16]. In this sense, ART-HPO can be viewed as a diversity-aware extension of random search rather than a surrogate-based optimizer.
Bayesian Optimization (BO) is the dominant paradigm for sample-efficient HPO. It constructs a probabilistic surrogate model of the objective function and uses an acquisition function to determine the next candidate configuration to evaluate by trading off exploration of uncertain regions against exploitation of known good regions [5].
The standard BO framework models the objective as a sample from a Gaussian Process:
where is the mean function (often set to zero) and is a covariance kernel, typically the Matérn 5/2 kernel:
After observing evaluations , the posterior predictive distribution at a new point is Gaussian with mean and variance obtained via standard GP conditioning formulae.
The Expected Improvement (EI) acquisition function, introduced for HPO by Snoek et al. [3], is defined as:
where is the best observed value. Under the GP posterior, this has a closed form:
The Upper Confidence Bound (UCB) acquisition:
with controlling the exploration-exploitation trade-off. Max-value Entropy Search (MES) [49] selects points that maximally reduce uncertainty about :
The TPE, introduced by Bergstra et al. [17], bypasses the GP entirely by modeling the density of hyperparameter configurations as:
where is the density of “good” configurations, is the density of “bad” configurations, and is the -quantile of observed losses. TPE then selects that maximizes , which is proportional to EI. TPE natively handles conditional search spaces through its tree-structured factorization, making it particularly effective for neural architecture HPO [18].
Sequential Model-based Algorithm Configuration (SMAC) [32] employs a Random Forest surrogate instead of a GP or TPE model, making it scalable to high-dimensional categorical and mixed-type spaces. SMAC has demonstrated strong performance on algorithm configuration benchmarks and is widely used in the AutoML literature.
Genetic Algorithms (GA) encode hyperparameter configurations as chromosomes and apply selection, crossover, and mutation operators to evolve a population over generations. The fitness function is the validation performance:
GAs are well-suited for mixed discrete-continuous search spaces and impose no assumptions on the objective landscape’s smoothness. The primary limitation is the large number of objective evaluations required to maintain a sufficiently diverse population typically where is population size and is the number of generations [19]. Population-Based Training [22] can be viewed as a modern parallel GA variant that incorporates Lamarckian evolution (inheriting trained weights) for significant efficiency gains.
Particle Swarm Optimization (PSO) maintains a swarm of particles, each with position and velocity . The update rule is:
where is the particle’s personal best, is the global best, is the inertia weight, and are cognitive and social coefficients. PSO has been applied to SVM hyperparameter tuning [21] and CNN architecture search, demonstrating competitive accuracy with lower computational overhead than GAs due to its gradient-free velocity mechanism.
Differential Evolution (DE) generates trial vectors by adding scaled difference vectors to population members:
followed by crossover with the target vector. DE is particularly effective for continuous hyperparameter spaces with strong inter-parameter correlations and has been applied to XGBoost and deep learning HPO with favorable convergence properties [33].
Ant Colony Optimization (ACO) builds solutions probabilistically on a discretized pheromone graph. Its extension to continuous domains (ACOR) deposits pheromone on Gaussian kernels centered at previous solutions. ACO has found application in feature selection combined with HPO, and in configuring decision tree ensembles, where the discrete nature of tree parameters aligns well with ACO’s graph-based formulation [34].
DARTS [25] reformulates the discrete architecture search problem as a continuous optimization by introducing architecture parameters that represent mixture weights over candidate operations:
The architecture and model weights are jointly optimized via bilevel optimization:
This allows gradient descent over architecture parameters, reducing NAS costs from thousands of GPU days to a few GPU days. However, DARTS is known to collapse toward parameter-free operations (skip connections) under certain conditions, a problem addressed by subsequent variants [35].
Gradient-based HPO methods such as DrMAD and T1-T2 optimization [36] compute approximate gradients of validation loss with respect to hyperparameters by differentiating through the training process using implicit differentiation or truncated backpropagation. These methods are efficient for continuous hyperparameters but require careful treatment of numerical stability and are difficult to apply to discrete or categorical hyperparameter spaces.
RL-based HPO frames the search as a sequential decision-making problem. The NAS-RL framework of Zoph and Le [23] trains an RNN controller that outputs architecture tokens; each sampled architecture is trained to convergence, and its validation accuracy is used as a reward signal for REINFORCE:
This formulation is general but notorious for requiring tens of thousands of GPU hours for convergence. More efficient variants ENAS [24], which shares weights across architectures reduced cost by approximately .
Population-Based Training (PBT) [22] frames HPO as an online learning problem: a population of agents trains simultaneously, and those performing poorly are periodically replaced by perturbed copies of high-performing agents. PBT can discover hyperparameter schedules (not just fixed values) and has been applied successfully to reinforcement learning and generative model training.
Real-world deployments frequently require simultaneous optimization of multiple, often conflicting objectives accuracy versus inference latency, model size versus robustness, training cost versus generalization. Multi-objective HPO seeks the Pareto-optimal front:
Multi-objective evolutionary algorithms such as NSGA-II and MOEA/D are naturally suited to this problem and have been combined with Bayesian surrogates in frameworks such as PESMO [37] and MOBO (Multi-Objective Bayesian Optimization). Hardware-aware NAS methods including MNasNet and Once-for-All apply multi-objective optimization to jointly minimize inference latency on target hardware alongside accuracy [38]. Energy-aware HPO has emerged as a response to the growing computational footprint of large model training, penalizing high-FLOPs configurations even when they achieve marginally superior accuracy [30].
NAS represents the logical extension of HPO to the complete architectural design problem. Early NAS work [23] required GPUs running for days to discover competitive image classification architectures. The subsequent development of one-shot methods wherein all candidate architectures share a single super network with shared weights dramatically reduced this cost. ENAS [24] achieved NAS in approximately 12 GPU hours, while DARTS [25] reduced search time to under 4 GPU hours on CIFAR-10.
Progressive NAS (PNAS) [39] employs a sequential model-based search that expands the architecture from simple to complex, using a learned predictor to rank candidate cells. Hardware-aware NAS has extended these methods to constrained deployment environments, discovering architectures specifically optimized for mobile, FPGA, and edge hardware.
AutoML formalizes the Combined Algorithm Selection and Hyperparameter Optimization (CASH) problem:
where is the set of candidate algorithms and is the hyperparameter space of algorithm .
Auto-sklearn [28] solves the CASH problem using SMAC with meta-learning warm-starts and ensemble construction. It has achieved competitive performance across hundreds of benchmark datasets. Optuna [40] introduced define-by-run API semantics, enabling dynamic search spaces that adapt to the current trial’s results. Ray Tune [41] provides a distributed HPO framework with support for Hyperband, PBT, and BO backends. SMAC3 provides a mature implementation of Random Forest-based Bayesian optimization with strong theoretical guarantees. Hyperopt implements TPE and is widely deployed in industrial ML pipelines.
The ecosystem of HPO software has matured considerably. Table 1 synthesizes existing literature surveying these frameworks and their applications.
Table 1: Comparative Review of Existing Literature on HPO Techniques and Applications
|
Reference |
Year |
Method |
Model(s) Tuned |
Dataset(s) |
Key Contribution |
|
Bergstra & Bengio [14] |
2012 |
Random Search vs Grid Search |
DNNs |
MNIST, CIFAR |
Showed RS outperforms GS for high-dim HPO |
|
Snoek et al. [3] |
2012 |
GP-based BO (Spearmint) |
DNNs, SVMs |
MNIST, DBN |
Formal BO for HPO; introduced EI acquisition |
|
Bergstra et al. [17] |
2013 |
TPE (Hyperopt) |
DNNs |
Multiple |
Tree-structured Parzen estimator for conditional spaces |
|
Hutter et al. [32] |
2011 |
SMAC |
SAT solvers, SVMs |
Algorithm benchmarks |
RF surrogate for CASH; meta-learning warm-start |
|
Zoph & Le [23] |
2017 |
NAS-RL |
CNN |
CIFAR-10, PTB |
RL-based NAS, 800 GPU days |
|
Liu et al. [25] |
2019 |
DARTS |
CNN, RNN |
CIFAR-10, PTB |
Differentiable architecture search |
|
Pham et al. [24] |
2018 |
ENAS |
CNN |
CIFAR-10 |
Weight sharing for 1000x speedup |
|
Jaderberg et al. [22] |
2017 |
PBT |
DNNs |
Multiple |
Parallel online HPO with adaptation |
|
Feurer et al. [28] |
2015 |
Auto-sklearn |
Multiple classifiers |
OpenML |
CASH + meta-learning + ensembling |
|
Falkner et al. [42] |
2018 |
BOHB |
DNNs |
Multiple |
Bayesian BO + Hyperband multi-fidelity |
|
Akiba et al. [40] |
2019 |
Optuna |
General ML |
Multiple |
Define-by-run API, pruning, TPE |
|
Liaw et al. [41] |
2018 |
Ray Tune |
Deep learning |
Multiple |
Distributed HPO with Hyperband and PBT |
|
White et al. [35] |
2021 |
DARTS stabilization |
CNNs |
NAS benchmarks |
Collapse prevention in differentiable NAS |
|
Loshchilov & Hutter [20] |
2016 |
CMA-ES for HPO |
DNNs |
CIFAR |
Competitive evolution strategy for HPO |
|
Real et al. [43] |
2019 |
Aging Evolution |
NAS |
ImageNet |
Tournament selection for scalable NAS |
To provide a systematic comparison, we evaluate HPO methods across six dimensions: (1) search efficiency (quality of configurations found per evaluation), (2) computational cost (wall-clock time and resources required), (3) convergence speed (evaluations to reach near-optimal performance), (4) scalability (behavior in high-dimensional spaces), (5) robustness (sensitivity to noise and initialization), and (6) exploration-exploitation balance (ability to avoid premature convergence).
Table 2: Comparison of HPO Methods Across Key Evaluation Criteria
|
Method |
Search Efficiency |
Computational Cost |
Convergence Speed |
Scalability (High-dim) |
Robustness |
Exploration-Exploitation |
Deep Learning Suitability |
|
Grid Search |
Very Low |
Very High |
Slow |
Very Poor |
High |
Exploration only |
Poor |
|
Random Search |
Low–Medium |
Low |
Moderate |
Moderate |
High |
Exploration only |
Moderate |
|
Gaussian Process BO |
High |
Medium |
Fast |
Poor (d > 20) |
High |
Tunable (EI/UCB) |
Moderate |
|
TPE (Hyperopt) |
High |
Low–Medium |
Fast |
Good |
Medium |
EI-based |
Good |
|
SMAC (RF surrogate) |
High |
Medium |
Fast |
Good |
Medium |
EI-based |
Good |
|
Genetic Algorithm |
Medium |
High |
Slow–Medium |
Medium |
Medium |
Controllable |
Moderate
|
|
PSO |
Medium |
Medium |
Medium |
Medium |
Medium |
Inertia-controlled |
Moderate |
|
Differential Evolution |
Medium–High |
Medium |
Medium |
Medium |
High |
F-controlled |
Moderate |
|
CMA-ES |
High |
Medium |
Fast |
Medium |
High |
Covariance-based |
Good |
|
DARTS |
High |
Low (GPU grad) |
Very Fast |
High |
Low |
Gradient bias |
Excellent |
|
NAS-RL |
Very High |
Very High |
Very Slow |
High |
Low |
Policy entropy |
Excellent |
|
PBT |
High |
Medium |
Fast (online) |
Medium |
High |
Population diversity |
Excellent |
|
BOHB |
Very High |
Medium |
Fast |
Good |
High |
BO + multi-fidelity |
Excellent |
|
Hyperband |
Medium |
Low |
Fast |
Good |
High |
Bandit arms |
Very Good |
|
ART-HPO |
Medium–High |
Low–Medium |
Moderate |
Good |
High |
Diversity-first exploration |
Good |
SVM tuning involves a small number of continuous hyperparameters ( , , degree), making GP-based BO and TPE highly effective in fewer than 50 evaluations. Random search is competitive for binary kernels. GA-based methods have been applied to multi-kernel SVMs with success [21].
Random Forest and XGBoost tuning involve moderate-dimensional spaces (6–12 hyperparameters) with important categorical choices. SMAC and TPE excel here due to their handling of conditional and categorical variables. PBT has shown promise for XGBoost with learning rate warm-up schedules.
Deep Neural Networks present the most challenging HPO scenario. BOHB [42] is currently the strongest general-purpose method, combining the sample efficiency of Bayesian optimization with the computational savings of successive halving. PBT is preferred when training dynamics matter (e.g., cyclical learning rates). DARTS and ENAS are the methods of choice when the architecture itself is variable.
Table 3: Advantages, Limitations, and Practical Applications of Major HPO Techniques
|
Technique / Framework |
Core Algorithm |
Key Advantages |
Key Limitations |
Best Applications |
|
Grid Search |
Exhaustive enumeration |
Simple, reproducible, fully parallel |
Exponential complexity, wasteful |
Small search spaces, final grid refinement |
|
Random Search |
Uniform sampling |
Simple, parallel, competitive baseline |
No adaptation, sample-inefficient |
First-pass exploration, high-dim spaces |
|
Spearmint / GP-BO |
GP + EI |
Sample-efficient, principled uncertainty |
GP cost, poor scaling |
Low-dim, expensive evaluations (e.g., SVM) |
|
Hyperopt / TPE |
Tree-structured Parzen |
Conditional spaces, scalable, fast |
No explicit exploration guarantee |
DNNs, NLP, conditional pipelines |
|
SMAC3 |
RF surrogate + EI |
Mixed types, robust, strong CASH support |
Higher overhead than TPE |
Algorithm configuration, AutoML |
|
Optuna |
TPE + CMA-ES |
Define-by-run, pruning, distributed |
Newer, less theoretical backing |
General ML, PyTorch, LightGBM |
|
Ray Tune |
Distributed BO/PBT |
Scalable, multi-GPU, backend-agnostic |
Complexity, infrastructure overhead |
Large-scale deep learning |
|
BOHB |
BO + Hyperband |
Best of BO + multi-fidelity, robust |
Requires fidelity proxy |
DNNs, NAS, resource-constrained settings |
|
Auto-sklearn |
SMAC + meta-learning |
End-to-end pipeline, ensemble output |
Slow on new datasets without meta-data |
Tabular data AutoML |
|
NAS-RL (NASNet) |
RL controller |
State-of-the-art architectures |
800+ GPU days, impractical |
Architecture discovery research |
|
DARTS |
Gradient-based NAS |
Fast, differentiable, low cost |
Operation collapse, instability |
CNN/RNN architecture search |
|
PBT |
Evolutionary + RL |
Online schedule learning, efficient |
Non-reproducible, stochastic |
Reinforcement learning, GAN training |
|
PSO |
Swarm intelligence |
Derivative-free, conceptually simple |
Premature convergence, many evaluations |
Continuous HPO, ensemble tuning |
|
Differential Evolution |
Population + mutation |
Robust, good for correlated parameters |
Many evaluations, manual tuning , |
XGBoost, continuous spaces |
The computational complexity of HPO methods varies dramatically. Grid Search has complexity where is the grid resolution per dimension and is the number of hyperparameters—clearly intractable for modern neural networks. Random Search has complexity where is the evaluation budget, making it trivially scalable. GP-based BO incurs cost per surrogate update due to GP regression, limiting it to observations in practice; sparse GP approximations and scalable kernels extend this range but introduce approximation error [5].
TPE scales as for KDE-based density estimation, making it tractable for thousands of trials. SMAC with Random Forest surrogates has training and prediction cost (where is the number of trees), providing a good balance of scalability and accuracy [32].
Evolutionary methods scale as where is the cost of a single evaluation. Their primary bottleneck is the large number of expensive function evaluations required to evolve effective populations. Multi-fidelity methods such as Hyperband and BOHB address evaluation cost directly by allocating minimal resources to poor-performing configurations, achieving significant total speedups at the cost of fidelity assumptions.
In the context of distributed HPO, Ray Tune enables asynchronous parallel evaluation on multi-GPU clusters, effectively reducing wall-clock time by a factor approaching the number of parallel workers. Distributed BOHB extends this to the Bayesian setting through asynchronous surrogate updates [41].
Most HPO methods degrade in high-dimensional spaces. GP-based BO suffers from the curse of dimensionality in both the surrogate fitting and acquisition function optimization steps. Additive models of the objective where performance decomposes as a sum over subsets of hyperparameters can partially mitigate this, but the decomposition must be inferred from data [13]. Subspace selection methods that identify the effective low-dimensional manifold have shown promise but remain an active research area.
Validation performance estimates are noisy due to random initialization, data shuffling, and mini-batch stochasticity. GP-based BO can model noise explicitly, but surrogate fitting under high noise requires careful regularization. Non-stationarity where the objective landscape changes as the optimization progresses (e.g., due to learning rate warmup) violates the stationary kernel assumptions of standard GPs [12].
Warm-starting HPO using knowledge from related datasets can dramatically reduce the required evaluation budget. However, identifying the appropriate notion of dataset similarity and ensuring that transferred knowledge is helpful rather than misleading remains non-trivial [28]. Task2Vec and dataset2vec representations have been proposed as meta-feature encoders, but their generalization across domain shifts is not well characterized.
The HPO literature suffers from inconsistent evaluation protocols: different papers use different budgets, hardware, and validation methodologies, making direct comparisons unreliable. HPOBench [44] provides a standardized benchmarking suite with tabular results from pre-computed grids, enabling fair comparison under identical computational budgets.
Federated learning introduces additional complexity to HPO: the hyperparameter landscape depends on the data distribution across clients, which is heterogeneous and unavailable to a central server. Federated HPO must navigate the communication overhead of transmitting model performance results alongside gradients, and must handle non-IID data distributions that make standard surrogate models unreliable [45].
The field of explainable HPO seeks to quantify the contribution of individual hyperparameters to model performance through tools such as functional ANOVA decomposition [46] and fANOVA-based sensitivity analysis. Understanding which hyperparameters matter most and under what conditions enables practitioners to prune the search space intelligently and builds trust in the optimization process. This direction intersects with the broader push for interpretable AI.
Training large language models (LLMs) such as GPT-4, LLaMA, and Mistral involves hyperparameter choices learning rate schedules, warmup steps, batch size, gradient clipping at a scale where each evaluation costs millions of GPU hours. Standard HPO methods are entirely inapplicable at this scale. Scaling laws [47] empirical power-law relationships between model size, compute, data volume, and loss partially address this by enabling extrapolation from small-scale experiments. However, the interaction between architectural choices and optimization hyperparameters at frontier scale remains poorly understood and represents a high-impact open problem.
The environmental cost of HPO is substantial: searching over large spaces with expensive evaluations contributes significantly to the carbon footprint of ML research [30]. Carbon-aware HPO frameworks that incorporate energy consumption as an explicit optimization objective, schedule evaluations during low-carbon-intensity periods, and apply aggressive early stopping represent an important emerging research direction. BOHB and multi-fidelity methods represent early steps in this direction, but energy-aware acquisition functions and hardware-specific efficiency modeling remain open problems.
Deploying ML models on edge devices imposes strict constraints on model size, latency, and energy. Hardware-aware NAS and multi-objective HPO methods that incorporate device-specific benchmarks are essential for this domain. Federated HPO where hyperparameter search is conducted across distributed, privacy-sensitive data requires communication-efficient protocols and robustness to client heterogeneity and dropout [45].
Meta-learning for HPO leverages performance data from previous tasks to initialize surrogate models with informed priors, dramatically accelerating search on new tasks. The RGPE (Ranking-Weighted Gaussian Process Ensemble) model [48] combines predictions from task-specific GPs weighted by their performance ranking, providing a principled warm-start. Learning to optimize (L2O) frameworks train the HPO algorithm itself as a recurrent network across a distribution of tasks, potentially enabling amortized HPO at inference time.
For practitioners with very tight evaluation budgets, zero-shot HPO selecting a single configuration based on meta-features without any evaluations is an emerging paradigm. Portfolio methods that maintain a fixed set of configurations with strong average performance across a task distribution represent one approach; meta-learned default configurations derived from extensive offline search represent another.
This review has provided a comprehensive comparative analysis of hyperparameter optimization techniques spanning traditional exhaustive methods, probabilistic Bayesian frameworks, evolutionary and metaheuristic approaches, gradient-based architecture search, reinforcement learning controllers, multi-objective optimization, and AutoML pipelines. The analysis reveals several overarching conclusions.
First, no single HPO method universally dominates. The appropriate method depends critically on the dimensionality of the search space, the cost of objective evaluation, the nature of the hyperparameter types (continuous, integer, categorical, conditional), and the computational budget available. For low-dimensional, expensive evaluations, GP-based Bayesian optimization remains the gold standard. For high-dimensional, moderate-cost settings, TPE and SMAC provide an effective balance of scalability and efficiency. For deep learning at scale, BOHB and PBT represent the state of the practice.
Second, multi-fidelity methods represent the most important practical advance of the past decade. By intelligently allocating evaluation budgets based on intermediate performance, methods like Hyperband and BOHB achieve order-of-magnitude speedups over single-fidelity approaches without sacrificing solution quality.
Third, the AutoML paradigm has democratized HPO by packaging sophisticated optimization within end-to-end pipelines that non-experts can apply directly. However, the black-box nature of AutoML reduces interpretability and complicates debugging when pipelines fail.
Fourth, critical open problems remain at the frontier of HPO research: scaling to LLMs, federated and privacy-preserving settings, energy-aware optimization, and meta-learning for rapid adaptation to new tasks. These directions will define the trajectory of the field over the coming decade.
As machine learning continues its expansion into high-stakes domains and resource-constrained environments, the importance of principled, efficient, and interpretable HPO will only grow. This review aims to serve as a navigational reference for researchers and practitioners entering or advancing within this foundational area of the machine learning ecosystem.