top of page

Keeping Track of Experiments with MLFlow

Updated: Apr 4

When working in Machine Learning and AI, either with traditional techniques or with deep learning, keeping track of experiments is an essential part of the process, both from an industrial perspective as well as for academic research.

Looking for the best set of hyperparameters (i.e., what in the technical jargon is usually referred as hyperparameter optimization or "HPO") so that your model has the best results (whether because it generalizes better, wins the competition, reaches the state-of-the-art, etc.) is a crucial part of the job and it can become a hefty burden when there's no easy way to follow what has been done, and what model hyperparameters are the best ones for the task at hand.

The most straightforward approach to hyperparameter optimization is logging the results into a text file (e.g., writing the set of hyperparameters and the results for the metrics we are trying to optimize). However, this technique is exceptionally simplistic, and in my experience (basically, this was the only way I had available while working on my Ph.D. thesis), it becomes unmanageable as soon as the number of experiments starts to grow, which is extremely common for example if we are doing something simple yet powerful like grid search [1] which can grow exponentially in number of experiments.

A couple of years ago, Google came up with a better solution to the traditional way of logging everything in text files (or, even worse, console) in the form of TensorBoard [2]. This tool has become a de facto standard in many machine learning applications and libraries, to the point that even PyTorch [3], the framework that became the direct competitor of TensorFlow [4] (the original library under which TensorBoard was built on top of), has a builtin wrapper to use TensorBoard in a more "native" way [5].

The problem with this last solution is that, as TensorBoard has been thought of as a complimentary library of TensorFlow, the full potential of the library is difficult to grasp from the PyTorch perspective. Moreover, if we are using more traditional machine learning techniques, standard in the industry, that don't even require deep learning frameworks (e.g., the scikit-learn library [6], or specialized libraries like Annoy [7] for approximate nearest neighbors search or FastText [8] for text classification and embeddings) the whole TensorBoard library can become overkill. So, what is my weapon of choice over TensorBoard?


MlFlow Logo

Since 2019, my go-to solution for keeping track of experiments has been MLFlow [9]. This solution not only offers tools for keeping track of experiments (both locally or remotely), making comparison of results extremely easy [10], but it also provides tools for model deployment and management.

In this article, I will show you the basics of MLFlow and how to add it to your workflow to keep experimentation on track. I will also share with you some personal ways of working that I use on my daily experimentation cycle to take advantage of MLFlow better when doing hyperparameter optimization. Finally, to tie this up with my previous post [11], I will show you how you can use MLFlow alongside PyTorch Lightning [12] to keep track of experiments using the deep learning framework.

Environment Set Up

Let's set up an environment for dealing with MLFlow first. We'll be creating a simple Python Virtual Environment, and we'll install a couple of tools to deal with MLFlow:

$ python -m venv venv
$ source ./venv/bin/activate
(venv) $ pip install -U mlflow scikit-learn lightning torchvision

We installed the MLFlow library, but we added a couple more libraries that will serve us as well: Scikit Learn for traditional machine learning (plus it has a large set of metrics for Machine Learning), Lightning for the model we'll be building, and finally Torchvision [13] which gives us access to the MNIST dataset [14] which we are going to use for our example training.

The Basics of MLFlow

Before showing how to combine MLFlow with Lightning, which does require a few extra steps, I wanted to show you the basics of MLFlow using the SVM classifier [15] from Scikit Learn and the classic Iris dataset [16]:

import mlflow
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import ParameterGrid, train_test_split
from sklearn.svm import SVC

data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    data['data'], data['target'],
    train_size=0.8, random_state=42

grid = [
    {"kernel": ["linear"], "C": [0.1, 1]},
    {"kernel": ["rbf", "poly"], "gamma": [1, 10]}
for parameters in ParameterGrid(grid):
    with mlflow.start_run():
        clf = SVC(**parameters).fit(X_train, y_train)
        y_pred = clf.predict(X_test)
            "accuracy": accuracy_score(y_test, y_pred),
            "f1_score_macro": f1_score(y_test, y_pred, average="macro")

The previous snippet runs a simple Grid Search using ParameterGrid [17] and tries three different kernels ("linear", "rbf", and "polynomial") with some specific parameters (in the case of the linear kernel, it varies the regularization strength with the "C" parameter, and for the other two kernels it changes the kernel coefficient).

For each set of parameters, it tracks an MLFlow run [18], during which it logs the parameters (a dictionary of parameter names and parameter values) and, after training the model, the model evaluation metrics.

Since we didn't touch the default configuration from MLFlow, the script will create a directory inside the same directory where it was run: mlruns. In this directory, all our experiments will be stored. To access MLFlow UI, we simply run:

(venv) $ mlflow ui

In the same directory where mlruns is, then go to http://localhost:5000 to check our results. If everything has gone accordingly, you should see something like this:

MLFlow UI showing the runs of the SVM classification experiments over the Iris Dataset.
MLFlow UI showing the runs of the SVM classification experiments over the Iris Dataset.

This is the primary UI from MLFlow, and it shows the 6 experiment runs that were done in the previous script. As you can see, the UI is pretty intuitive:

  • On the left, we have the list of experiments: Since we didn't set up an experiment (neither by name nor ID), it was assigned to the "Default" experiment. Experiments are conformed from different runs (or iterations) over some specifications. Even though you can technically run everything within the same experiment, it's helpful to separate different experiments by things in common. For example, an experiment can be thought of as the different configurations over an evaluation dataset (e.g., validation vs test data) or a type of model (e.g., generative vs. discriminative models), etc. Each experiment consists of a series of runs that have different parameter values.

  • For each run, you have a table with the run name (an identifier useful for easily determining what run did what; by default, it gives some random name, but it can be changed), when it was created, and the duration, among some other metadata of the run.

If we access other columns (we can show and hide based on the "Columns" button, we can show the different parameters and metrics. We can then order the results by some of the metrics and obtain something like this:

MLFlow runs comparisons of parameters and metrics.
MLFlow runs comparisons of parameters and metrics.

Here, we decided to show only Metrics and Parameters (hiding most of the other attributes) and use a descending order of the f1-score macro average to get the best run. Because of the simplicity of the problem, of course, in this case, most of the configurations gave a perfect score on the test data. The parameters were the ones logged for each run, and since not all the runs had access to all the parameters (e.g., the "linear" kernel didn't have access to the "gamma" parameter, and the other kernels didn't have access to the "C" parameter), some of the parameters appear partially in the Table.

Another view we can check is the Chart view, with the button next to the "Table" view:

MLFlow Chart View. It shows a horizontal bar plot for each of the experiments.
MLFlow Chart View. It shows a horizontal bar plot for each of the experiments.

This view showcases a plot comparing the different runs we had for the two metrics we logged. Since, in this case, we only logged each metric at the end, the plot is a horizontal bar plot.

Finally, if we select the different runs by clicking the checkbox to the left of each row and then "Compare" them, we have another view with more charts for comparing them:

MLFlow runs comparison showing a scatterplot using the type of kernel as the x-axis and the F1-score as the y-axis.
MLFlow runs comparison showing a scatterplot using the type of kernel as the x-axis and the F1-score as the y-axis.

By default, the comparison UI shows a Parallel Coordinate Plot. However, I prefer either the Scatter Plot or the new Box Plot. In this case, I selected the Scatter Plot, with the x-axis being the type of Kernel and the y-axis being the f1-Score macro average. It is easy to see that the only kernel with a configuration that doesn't reach the perfect score is the "poly" kernel.

This view also shows the parameters (highlighting those different parameters in yellow) for each of the six runs we have here, as well as the metrics. It's an instrumental view if we want more detail, or the Charts from the main view are not enough for our analysis.

MLFlow + Lightning

We have seen the basics of MLFlow, such as how to set up a simple experiment, track the runs, and analyze them in the database. However, this is merely scratching the surface of what MLFlow can do, and I really recommend the MLFlow documentation [19] for a more thorough understanding of its mechanics.

The goal of this article isn't a substitute for their excellent documentation but to give you some insight on things that aren't available in the documentation and one point that I think is worth checking out (not only because it's a compelling combination, but also because the documentation on their usage is pretty thin), is how to use MLFlow as the Logger in a PyTorch Lightning training experiment, and what are some of the techniques I find helpful when using MLFlow as the Logger [20].

The Libraries

We will be using MLFlow for tracking the experiments, Pytorch Lightning to build the model (with the help of the Neural Networks module of Torch), and Scikit Learn for running some evaluation metrics:

import mlflow
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import lightning.pytorch as pl

from itertools import chain
from lightning.pytorch.callbacks import EarlyStopping
from lightning.pytorch.loggers import MLFlowLogger
from sklearn import metrics
from torch.optim import Adam
from torch.utils import data
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor

The Dataset

As we said before, we are going to use a classic deep learning dataset, the MNIST, for automatic handwritten digit recognition. We'll use the TorchVision library's version, which provides training and test data. We will use 80% of the training data for the training, with 20% for validation (and early stopping):

train_dataset = MNIST(os.getcwd(), download=True,
                      transform=ToTensor(), train=True)
test_dataset = MNIST(os.getcwd(), download=True,
                     transform=ToTensor(), train=False)

train_set_size = int(len(train_dataset) * 0.8)
valid_set_size = len(train_dataset) - train_set_size
seed = torch.Generator().manual_seed(42)
train_dataset, validation_dataset = data.random_split(
    [train_set_size, valid_set_size],
train_dataloader = data.DataLoader(
test_dataloader = data.DataLoader(
validation_dataloader = data.DataLoader(

The Model

The model will be a simple multilayer perceptron with a single hidden layer since it provides more than enough power for the simple MNIST problem. We define it as a Lightning Module and set the steps for training, validation, and prediction, as well as the optimizer configuration:

class MNISTClassifier(pl.LightningModule):
    def __init__(self, hidden_size=64, lr=1e-3, weight_decay=1e-6):
        self.model = nn.Sequential(
            nn.Linear(28*28, hidden_size),
            nn.Linear(hidden_size, 10)

    def forward(self, x):
        # Reshapes the 3d (batch, width, height) to 2d (batch_size, width*height)
        # and inputs it to the model
        return self.model(x.view(x.size(0), -1))

    def _loss(self, x, y, **kwargs):
        loss = F.cross_entropy(self(x), y)
        self.log(value=loss, **kwargs)
        return loss

    def training_step(self, batch, batch_idx):
        return self._loss(*batch, name="train_loss",
                          prog_bar=True, on_epoch=True)

    def validation_step(self, batch, batch_idx):
        return self._loss(*batch, name="validation_loss",
                          on_epoch=True, on_step=False)

    def predict_step(self, batch, batch_idx, dataloader_idx=0):
        logits = self(batch[0])
        return torch.argmax(logits, dim=-1)

    def configure_optimizers(self):
        optimizer = Adam(self.parameters(),
        return optimizer

The call to the save_hyperparameters() method at the initialization saves all the parameters of the init function into the hparams attribute, which we can access later (e.g., for the configuration of the optimizer). We define a method for the loss that both the training and validation steps can use. The on_epoch passed to the log is to tell the logger to save the value at the end of 1 epoch. For the case of the validation step, it's also helpful to avoid saving the loss for each step since one single step (i.e., the loss over a single batch) only tells us a little.

The Experiment Run

The last part of the process is to run the experiment. In this case, we won't be running a Grid Search like for Scikit Learn, but we will run the same experiment twice, only changing the size of the hidden layer (we'll use 64 and 32):

tracking_uri = "./mlruns"
experiment_name = "Lightning MNIST"
hidden_size = 64
epochs = 5
early_stopping = 2

with mlflow.start_run() as run:
    logger = MLFlowLogger(
        "epochs": epochs,
        "early_stop": early_stopping

    classifier = MNISTClassifier(hidden_size=hidden_size)

    early_stopping = EarlyStopping(
    trainer = pl.Trainer(
    y_true = [instance[1] for instance in test_dataset]
    y_pred = list(chain(*[
        bp.tolist() for bp in
        trainer.predict(model=classifier, dataloaders=test_dataloader)

    acc = metrics.accuracy_score(
        y_true, y_pred
    f1_score = metrics.f1_score(
        y_true, y_pred, average='macro'
        "accuracy": acc,
        "f1_score": f1_score

    with open("/tmp/classification_report.txt", "wt") as fh:
        print(metrics.classification_report(y_true, y_pred), file=fh)

As you can see, we still make use of the start_run() from MLFlow, but this time we will set up a couple of things before it. By default, Lightning logs whatever goes through the MLFlowLogger to a directory called "lightning_logs", we overwrite it, set it up so it is "mlruns" and set the MLFlowLogger with the information to match the run of the block: the name of the directory where everything is stored in the "tracking_uri", the name of the experiment, the name and the id of the run. We use this logger in the Trainer instead of the TensorBoard Logger.

When logging the parameters, we only log the "epochs" and the "early_stopping". This is because these aren't the models' hyperparameters; the other hyperparameters (such as hidden_size, lr, and weight_decay) will be logged when the model calls save_hyperparameters().

After the model is trained, we get the predictions. Since the prediction step is done in batches, we have to concatenate those batches to flatten the list. That's why we use the chain function from itertools.

Finally, we log the metrics; in this case, we log the same metrics of accuracy and f1-score macro average, but in the end, we create a temporary file where we print the classification report. We can log this classification report as an MLFlow artifact [21], something that is very powerful and can become very handy.

After this, we can rerun the MLFlow UI, and we can see we have a new experiment with the two runs (you have to run the script twice, changing the value of hidden_size):

MLFlow UI with the results of 2 different runs of the Python Lightning Model for the MNIST dataset.
MLFlow UI with the results of 2 different runs of the Python Lightning Model for the MNIST dataset.

Since MNIST is a more complex dataset than the Iris dataset, the results are more nuanced. If we go to the Charts view, we will also see some other charts in comparison with the Iris dataset solution:

MLFlow Charts view for the MNIST Dataset runs. It shows the two logged metrics of Accuracy and F1-score, but it also shows the losses logged by the Pytorch Lightning Module.
MLFlow Charts view for the MNIST Dataset runs. It shows the two logged metrics of Accuracy and F1-score, but it also shows the losses logged by the Pytorch Lightning Module.

Again, we have our two horizontal bar plots with accuracy and the f1-score macro average for the two runs, but we also have a couple of line plots. You can ignore the "epochs" one, it is created because the MLFlowLogger logs the epoch as a metric at some point. The interesting ones are the other three. We have a line plot showing the training loss reduction along the run, both at each step (this is at batch level, thus making it more irregular) and at an epoch level, and we also have the validation loss at epoch level for each of the two runs.

If we click one of the runs, and in the run info page, we expand the list of metrics and click on the train_loss_epoch metric, we will see a line plot with the metric progression, and we can add the validation metric to the "Y-Axis" on the left so that we can compare train and validation losses in the same plot for the same run:

Train and Validation losses progression comparison for a single run in MLFlow.
Train and Validation losses progression comparison for a single run in MLFlow.

This view is handy when analyzing overfitting problems in the model. Finally, if we go back to the page of the run and expand the "Artifacts" section, we can see the classification report that we logged as an artifact in the final part of the script:

MLFlow single run view with the logged artifacts for the run.
MLFlow single run view with the logged artifacts for the run.

Even though, in this case, the example is quite simplistic, the artifacts allow us to record any type of file as part of the run. In particular, the UI supports displaying text, images, and CSV files directly, which can help with things like showing some plots that are not possible in MLFlow (e.g., the heatmap of a confusion matrix).

Final Thoughts

MLFlow provides a compelling and versatile way to keep track of the experimentation process. In this article, I barely scratched the surface of what it was possible to do with it. In my experience, using it alongside Lightning has made my job much more manageable. I can keep track of experiments, and I have a much easier time analyzing them (without having to waste a lot of time doing graphics). I am sharing my experience so you can also benefit from this winning combination.


[1] Scikit-Learn. "Tuning the hyper-parameters of an estimator".

[2] TensorBoard: TensorFlow's visualization toolkit.

[5] PyTorch. "How to use TensorBoard with PyTorch".

[6] Scikit Learn: Machine Learning in Python.

[7] Spotify. Annoy (Approximate Nearest Neighbors Oh Year).

[8] FastText: Library for efficient text classification and representation learning.

[9] MLFlow: ML and GenAI made simple.

[11] Cristian Cardellino. "Fine-Tuning Hugging Face Language Models with Pytorch Lightning". Transcendant AI.

[12] Lightning AI. PyTorch Lightning Documentation.

[14] Deng, L. (2012). The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6), 141–142.

[15] Scikit-Learn. Support Vector Classification.

[16] Fisher,R. A.. (1988). Iris. UCI Machine Learning Repository.

[19] MLFlow: A Tool for Managing the Machine Learning Lifecycle.

Recent Posts

See All


bottom of page