A couple of months ago, we saw how MLFlow [1]Â could help us track our machine learning experiments [2]. The tool provides a nice interface so we can easily compare the different experiments we want, see them in tabular form, and have different kinds of interactive plots.
However, if you have to manage multiple experiments, particularly when doing pure research, I'm fairly confident you have an extra problem to face. It's not just keeping track of the experiments but also dealing with the different configurations required to test different hypotheses, with lots of parameters to explore.
It is one thing to work on a project in which the parameters to check are more or less standard and limited, which is the case for most industrial applications, vs. working on a project where there's a lot of room for exploration, which is what usually happens in research (both academic and industry-oriented).
Hydra [3]Â is a tool that can help us manage this scenario of multiple configurations that deal with a large amount of parameters. It provides a framework that makes the configuration of complex applications easy, such as in the case of machine learning experiments. In this article, we will explore Hydra, what it brings to the researcher's table as a tool, and when we should use it in our daily job.
Wait, why do I need Hydra?
You don't unless you do. The truth of the matter is that Hydra adds an extra layer of complexity and code to be maintained in your workflow, so you should use it when it is really needed. And when is it really needed? Let's dive into an example to figure this out.
When writing a training script in Python, in order to deal with different parameters, an easy solution is usually using argparse, having an entry for each parameter we want to change for the different experiments. When the number of parameters is in a range that is less than ten, this is usually the best solution.
When dealing with more than ten parameters, however, it is better to go to a configuration file. The advantage of the configuration file is that you can just add new information to it and make little changes to the code, unlike using argparse. For example, suppose we want to be able to configure every parameter of an XGBoost model [4]; trying to have one parameter in the CLI via argparse is a nightmare, as some of the models have more than 25 parameters. However, a YAML configuration file can have a section specifically for the model parameters that can either have all or some of them. Then, using a YAML parser, we can load the configuration into a dictionary and then pass it to the model as kwargs. A much simpler solution when dealing with a large amount of parameters.
Nonetheless, the problem doesn't end there. When working in machine learning, a common strategy, especially when you are looking for the best configuration available, is to do a grid or random hyperparameter search [5]. This requires not only one configuration file but multiple ones (i.e., one per possible combination of hyperparameters). This is the case where Hydra can help us simplify the creation of multiple configurations.
What can Hydra do?
Some of the characteristics that made Hydra are the following:
Hierarchical configuration composed of multiple sources.
The configuration can be specified and overridden from the command line.
The application can be run locally or remotely.
The app can be run multiple times with different arguments using a single command.
Although we will explore the two first items in this article, you can read Hydra's documentation [6]Â to explore all the possibilities offered by the tool.
Installing Hydra
To install Hydra, you can create a Python virtual environment and install the package hydra-core. We will also install Scikit Learn and Pandas:
$ python -v venv hydra-venv
$ source ./hydra-venv/bin/activate
(hydra-venv) $ pip install hydra-core scikit-learn pandas
That should install the latest version of Hydra that is available.
Basic Hydra Application
The basic hydra application is defined by the decorator @hydra.main, which takes the path to the basic configuration file as input and transforms it into a DictConfig, which is a dictionary with the configuration. The configuration is a YAML file that is read as a dictionary. The library behind this is OmegaConf [7], which has its own set of interesting features that can be used for your work as well:
# experiment.py
import hydra
from omegaconf import DictConfig, OmegaConf
@hydra.main(config_path='conf', config_name='config', version_base=None)
def main(cfg: DictConfig):
print(OmegaConf.to_yaml(cfg))
if __name__ == "__main__":
main()
We run the experiment like a regular Python script:
(hydra-venv) $ python ./experiment.py
The config_path parameter is the directory where the configuration file is stored, while the config_name parameter is the name of the configuration file without the extension. These two parameters can be omitted, in which case all the configuration is expected to come from the CLI. The extra parameter version_base is not required, but it's recommended to be explicit. Hydra uses it for backward compatibility [8]. Finally, the function that is being passed to the decorator must have a parameter, in this case cfg, where the configuration will be stored and can be accessed in the application. If we use the parameters config_path and config_name, like in the example above, it is expected that we have a directory structure like this:
.
├── conf
│ └── config.yaml
└── experiment.py
If we run the previous program, it will simply print the content of the configuration file.
Hydra Output Directory
Since Hydra is built under the idea of having multiple runs with different configurations, to avoid the problem of needing to specify a new output directory for each run, Hydra creates a directory for each run and executes the code within that output directory [9]. By default, this output directory is used to store Hydra output for the run (configurations, logs, etc.). By default, each experiment is run under the directory ./outputs/YYYY-MM-DD/HH-MM-SS/.
The output directory of the run has a hidden directory named .hydra that holds three configuration files:
config.yaml: A dump of the user-specified configuration.
hydra.yaml: A dump of the Hydra configuration.
overrides.yaml: The command line overrides used. We will see about this shortly.
It is important to note that, in order for any text to be logged into this directory, we must use Python's logging [10]:
# experiment.py
import hydra
import logging
from omegaconf import DictConfig, OmegaConf
logger = logging.getLogger(__name__)
@hydra.main(config_path='conf', config_name='config', version_base=None)
def main(cfg: DictConfig):
logger.info(OmegaConf.to_yaml(cfg))
if __name__ == "__main__":
main()
Now, after we run the experiment, the log file will have the configuration YAML written in it.
Running an Experiment
So far, we have only shown a toy example of Hydra printing the configuration it reads when it runs. A more complex example would be to have it do some actual experimentation. We modify the experiment script:
# experiment.py
import hydra
import logging
import pandas as pd
from omegaconf import DictConfig, OmegaConf
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
logger = logging.getLogger(__name__)
@hydra.main(config_path='conf', config_name='config', version_base=None)
def main(cfg: DictConfig):
logger.info(f"Loading dataset from {cfg.input}")
data = pd.read_csv(cfg.input)
train_data = data.loc[data['Split'] == 'train'].iloc[:, 2:].values
train_target = data.loc[data['Split'] == 'train', 'Quality'].values
eval_data = data.loc[data['Split'] == 'test'].iloc[:, 2:].values
eval_target = data.loc[data['Split'] == 'test', 'Quality'].values
logger.info("Training classification model")
clf = LogisticRegression(**cfg.model).fit(train_data, train_target)
eval_preds = clf.predict(eval_data)
logger.info("Classification results:\n" + classification_report(eval_target, eval_preds))
if __name__ == "__main__":
main()
To run the previous script, you need to download the CSV file with the data from the Wine dataset [11]Â that I compiled in a GitHub Gist [12].
The configuration file (in ./conf/config.yaml) is the following:
input: ./wines-data.csv
model:
penalty: l2
solver: liblinear
C: 1000
Now, we can run the experiment. The results will be logged both into the console and the output directory.
Overriding the Configuration from the CLI
Part of Hydra's power lies in its ability to change configurations without the need to modify the configuration file. We can use the command line interface (CLI) to do it. There are three in that regard:
Override an existing parameter.
Append a new parameter.
Upsert a parameter (i.e., override if the parameter exists or create if it doesn't).
When running the command, we can specify the path to the parameter to change, insert, or upsert via the dotted path to that parameter. To insert a new value that is not present in the original configuration file, we must prepend +, and to upsert a value, we must prepend ++. So, in the previous code, we use the configuration in model directly as a kwargs dictionary for the logistic regression classifier [13]. We can add any value that is a parameter of it:
(hydra-venv) $ python ./hydra_basic/experiment.py \
model.penalty=l1 \
+model.max_iter=10000 \
++model.solver=saga \
++model.random_state=42
In the previous run, we changed the value of the penalty from l2 to l1, added the new parameter max_iter, changed the solver to saga using upsert, and, also via upsert, added the parameter random_state. If we inspect the configuration file from the output directory of this last run, we can see the updated configuration with all the overridden configurations:
input: ./wines-data.csv
model:
penalty: l1
solver: saga
C: 1000
max_iter: 10000
random_state: 42
Advanced Configuration File
Hydra supports more complex values than just strings and numbers in their configuration files, making it an extremely good tool when dealing with a large number of configuration paths. The flexibility of Hydra configuration is given, among other things, because of three properties:
Required Configurations: Sometimes, we need to require some specific configuration because there isn't a suitable default value. For example, we need to create an application where the user of the experiment script gives the path to the input file. These configurations are determined by a special value of three question marks: ???.
Value Interpolation: Some configuration values might be dependent on another value in the same configuration. In those cases, we can use interpolation to access the required value. We apply it via the special syntax for interpolation that is composed of a dollar sign and the path to the value between braces: ${path.to.config}.
Resolvers: These define functions in Python that will be evaluated during the execution of the experiment script in runtime. They provide a very powerful tool that gives extreme flexibility and should be treated with caution. I particularly use them along with the eval function in Python to have more flexibility. Of course, this should be done only during the training of a model, as putting it in production code will pose a security threat.
Moreover, Hydra allows to have a sort of submodule of configurations with multiple files and directories. Let's expand on our previous example to show the possibilities.
Suppose we have the following directory structure:
.
├── conf
│ ├── config.yaml
│ └── model
│ ├── logreg.yaml
│ └── svm.yaml
├── experiment.py
└── wines-data.csv
Now, there have been a few changes in the configuration files. We have a base configuration file, which is ./conf/config.yaml, and two different configuration files depending on the model we want to run: ./conf/model/logreg.yaml, for logistic regression, and ./conf/model/svm.yaml, for support vector machines [14].
First, the configuration files for the two models are pretty similar:
# logreg.yaml
module: ${eval:LogisticRegression}
params:
penalty: l2
solver: liblinear
C: ${eval:1 / 1e-3}
random_state: ${input.random_seed}
# svm.yaml
module: ${eval:LinearSVC}
params:
penalty: l2
loss: hinge
C: ${eval:1 / 1e-3}
random_state: ${input.random_seed}
Each of the two has the name of the module that they will use in the experiment script and the parameters. The name of the module and the C parameter are wrapped around a resolver eval that we will define in the experiment script, which will run the eval function of Python to evaluate the string into a Python value. For the case of the module, that Python value is the class of the model (i.e., LogisticRegression or LinearSVC). The case of the C parameter, which is the regularization strength, will be evaluated using a float value, the result of a division. There's also the random_state, which will look for a path input.random_seed, which is part of the base configuration file, in a case of value interpolation:
# config.yaml
defaults:
- _self_
- model: logreg
input:
data: ???
random_seed: 42
The configuration file changed slightly. The defaults section determines what the subconfiguration that will be used is; in this case, it is the logreg configuration for the model. The self entry determines the position of the config.yaml configuration in the final configuration. If it's left after the model configuration, any configuration will appear in the config.yaml file will override the configuration in the logreg.yaml file. The input entry was changed from a single value into a section with two values. The data value is expected to be overridden with the path to the input file, while the random_seed value is used in the interpolation of each of the model config files; having it here avoids having to override it in each config file separately.
The experiment script also changed a little:
# experiment.py
import hydra
import logging
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from omegaconf import DictConfig, OmegaConf
logger = logging.getLogger(__name__)
@hydra.main(config_path='conf', config_name='config', version_base=None)
def main(cfg: DictConfig):
OmegaConf.register_new_resolver('eval', lambda x: eval(x))
logger.info(f"Loading dataset from {cfg.input.data}")
data = pd.read_csv(cfg.input.data)
train_data = data.loc[data["Split"] == "train"].iloc[:, 2:].values
train_target = data.loc[data["Split"] == "train", "Quality"].values
eval_data = data.loc[data["Split"] == "test"].iloc[:, 2:].values
eval_target = data.loc[data["Split"] == "test", "Quality"].values
logger.info("Training classification model")
clf = cfg.model.module(**cfg.model.params).fit(train_data, train_target)
eval_preds = clf.predict(eval_data)
logger.info("Classification results:\n" + classification_report(eval_target, eval_preds))
if __name__ == "__main__":
main()
We register a resolver for eval that evaluates the string into a Python value, as we said before. Also, since the model can be either LogisticRegression or LinearSVC, we import them into the script so that when the configuration is evaluated, there aren't any errors. This time, we declare the classifier by calling the cfg.model.module, which was evaluated as one of the two possible modules available. If we run the experiment, it will require us to input the path to the dataset file:
(hydra-venv) $ python ./experiment.py input.data=./wines-data.csv
Otherwise, we can override the default values the same way we did before. We can even use the interpolation with resolvers directly from the CLI:
(hydra-venv) $ python ./hydra_complex/experiment.py \
input.data=./wines-data.csv \
model=svm \
model.params.C='${eval:1/1e-4}'
If we inspect the final configuration file for this last run, we get the following:
input:
data: ./wines-data.csv
random_seed: 42
model:
module: ${eval:LinearSVC}
params:
penalty: l2
loss: hinge
C: ${eval:1/1e-4}
random_state: ${input.random_seed}
Closing Remarks
In this article, we briefly explored Hydra, explaining why and when it is useful and giving examples of how we can use it, considering all its power and flexibility when working on research projects. It was a brief introduction to the library, with the idea that you would have yet another tool in your daily work in data science or engineering and machine learning research or engineering.
In an upcoming entry in this blog, we will see how I combine Hydra with the power of MLFlow to set up an experimentation framework that has served me excellently for the past couple of years.
References
[1] MLFlow: ML and GenAI made simple. https://mlflow.org/
[2] Cardellino, C. 2024. Keeping Track of Experiments with MLFlow. Transcendent AI. https://www.transcendent-ai.com/post/keeping-track-of-experiments-with-mlflow
[3] Hydra. A framework for elegantly configuring complex applications. https://hydra.cc/
[4] XGBoost Documentation. https://xgboost.readthedocs.io/en/stable/
[5] Scikit Learn Documentation. Tuning the hyper-parameters of an estimator. https://scikit-learn.org/stable/modules/grid_search.html
[6] Hydra Documentation. https://hydra.cc/docs/intro/
[7] OmegaConf Documentation. https://omegaconf.readthedocs.io/
[8] Hydra Documentation. version_base. https://hydra.cc/docs/upgrades/version_base/
[9] Hydra Documentation. Output/Working directory. https://hydra.cc/docs/tutorials/basic/running_your_app/working_directory/
[10] Python Documentation. Logging facility for Python. https://docs.python.org/3/library/logging.html
[11] Aeberhard, S., Coomans, D. and De Vel, O., 1994. Comparative analysis of statistical pattern recognition methods in high dimensional settings. Pattern Recognition, 27(8), pp.1065-1077.
[12] Wine dataset from UCI repository, in CSV format, split into train/test/validation. https://gist.github.com/crscardellino/d29014d8cd1605da17b895ef42ff71ff
[13] Scikit Learn Documentation. Logistic Regression. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
[14] Scikit Learn Documentation. Support Vector Machines. https://scikit-learn.org/stable/modules/svm.html
Comments