Deploying Machine Learning Models with FastAPI and Docker

Cristian Cardellino
Mar 15, 2024
12 min read

Updated: Apr 4, 2024

One of the most important things when working with machine learning models in the industry is the possibility of serving them in a production environment. After all, what's the point of having a perfect model that nobody can benefit from?

When working with machine learning models, one standard solution is to build a REST API around them and serve them through it. Luckily, in the Python community, there are several options to work with to serve a model. In this article, we will explore how to set up an environment to serve a model via a REST API, covering the basic setup, the configurations, and the tools needed to ease the deployment process using FastAPI [1] and Docker [2].

The Tools

Before starting with the implementation details, let me explain why I choose these tools for building and deploying machine learning models.

Why FastAPI?

The Python community offers several different tools for writing web applications. Generally, when I need to work on something that requires some form of UI, I go with Django [3]. However, even in those cases, I still prefer to serve the models with something that has better built-in capabilities for writing REST APIs; even though libraries such as DRF [4] do give them Django that possibility, I find it a little complex and requires too much configuration. Another tool that is very powerful but is way more straightforward to set up and configure than Django is Flask [5]; I find it very good when there is some form of simple UI needed because it gives you the possibility to seamlessly integrate the UI with the REST APIs with the help of some library like flask-stores [6], I have written entire demos for projects I work with using these tools [7].

However, when writing a pure REST API service that doesn't require any form of UI or even a DB (except for caching purposes), there is no better solution than FastAPI. It is straightforward to set up and get it running with minimum configuration. This, alongside ASGI servers like uvicorn [8] and perhaps some configuration to serve many minimal applications behind an nginx [9] proxy, it's a combination that has made my life as a machine learning engineer much simpler.

Why Docker?

Docker is my de facto tool for deployment. Even though I wouldn't recommend it for setting up anything that requires CUDA capabilities, for example, any deep learning model that requires a GPU for inference, because its configuration is quite cumbersome [10], the truth is that models requiring GPU to function are usually better managed through some dedicated cloud services. When working on a self-managed platform and needing to serve some models, especially when we don't have many resources, there are usually more cost-efficient solutions, and most of them don't require GPU. I recommend you to read my previous article on practical machine learning [11], where I explain in more detail why it's essential to start with more classical models before jumping to the latest technologies that usually don't translate well to real-world scenarios that manage lots of requests (unless you are in big tech).

Environment Setup

The Repository

The code for this article is available under MIT license in a GitHub repository [12]. You can clone it:

$ git clone https://github.com/crscardellino/basic-ml-deploy.git
$ cd basic-ml-deploy

The repository has the following structure:

.
├── config
│   ├── config.yaml
│   └── nginx
│       └── default.conf
├── docker-compose.yml
├── LICENSE
├── README.md
└── word_vectors
    ├── Dockerfile
    ├── __init__.py
    ├── main.py
    └── requirements.txt

The application lives in the word_vectors directory, and the configurations are stored in the config directory.

The Model

The first part will be to create a FastAPI application to serve our model. However, what model? Since we are not interested in the model training itself but only in the model serving, I decided I will build a simple wrapper REST API around a FastText [13] word vectors model, particularly the vectors trained for the English language with Common Crawl [14]:

$ mkdir models/  # We save our models in a subdirectory of the cloned repository
$ curl -L https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz -o ./models/cc.en.300.bin.gz
$ gunzip ./models/cc.en.300.bin.gz

We should now have the models as part of the root of our code:

.
├── config
│   ├── config.yaml
...
├── models
    └── cc.en.300.bin
...

The Packages

In the past couple of weeks, I have found the usage of uv from Astral [15] a real life-changer when it comes to deploying stuff in Python. It's a package manager written in Rust. It's swift for installing Python packages and works as a drop-in replacement for pip. Even though it has yet to be fully compatible, I find it great for most of my packaging needs. If it isn't enough, I can always fall back to the classic pip:

$ uv venv
$ source .venv/bin/activate
(.venv) $ uv pip install fastapi fasttext-wheel omegaconf uvicorn  # you can also use the requirements.txt file inside the word_vectors directory if you want to have the same version as the repository code

The previous command installs all the necessary libraries we are going to use. Besides FastAPI and FastText, I also choose to install OmegaConf [16], a library I find useful for managing configurations through YAML files, and uvicorn [8] for serving the application.

The REST API

The REST API is going to be the main application of this project. It is comprised of different functions and classes that make the FastAPI module. It's defined in the main.py file inside the word_vectors directory.

The GlobalConfig Class

The first class is the GlobalConfig class, which is instantiated as the config object:

@dataclass
class GlobalConfig:
    """
    Class to handle the configuration globally.
    The class is instantiated during the startup event of a FastAPI app.
    It is a Python dataclass.

    Attributes
    ----------
        config: DictConfig
            An OmegaConf's DictConfig that is loaded from a YAML file.
            Holds a list of key -> value pairs with the configuration
            for the API.
        model: fasttext.FastText._FastText | None
            The FastText model.
    """
    config: DictConfig = OmegaConf.create()
    model: fasttext.FastText._FastText | None = None


config = GlobalConfig()

This is a global variable, and even though it's not a good idea to have global variables defined in general, we internally treat this as a Singleton to have access to it throughout the whole application. Since the idea of FastAPI is to facilitate the creation of simple REST API wrappers, we usually should limit everything to be within the same module and in a single file. If, at some point, it's required to manage something more complex than a simple wrapper API, we might be better suited to use something more complex like Flask or Django.

It is a Python Data Class [17] with two attributes: the configuration, an OmegaConf DictConfig object, and the FastText model. Since we have to instantiate it before we have access to the configuration file and the model, we give it a couple of default values that we will override with the lifespan function.

The Arguments Class

Next, we have the WordVectorsArgs class. This is a subclass of Pydantic's [18] BaseModel:

class WordVectorArgs(BaseModel):
    """
    Arguments for the FastText's API call.
    It is based on a Pydantic's BaseModel.

    Attributes
    ----------
        text: str
            The text to be transformed by the FastText model.
    """
    text: str

    model_config = {
        "json_schema_extra": {
            "examples": [
                {
                    "text": "deploying with fastapi",
                }
            ]
        }
    }

Pydantic is a data validation library, which checks that the data matches the type annotation. FastAPI uses this tool to serialize the arguments given to the HTTP request as a JSON application. In this case, there is a single argument: the text will be transformed into word vectors with the FastText model. The model_config parameter is used to set an example for the OpenAPI documentation generated by the FastAPI application.

Suppose you are asking why I didn't use this same model to create the GlobalConfig class. In that case, it's because Pydantic doesn't have an internal way to serialize the DictConfig or the FastText model. We have to do it ourselves or allow arbitrary types in the validation, which defeats the original purpose. Since we don't need that serialization or validation as we are doing it internally, it isn't justified.

The Lifespan Function

One thing we have to manage in the FastAPI is the lifecycle of the models in the application flow. FastAPI offers a solution for this in the Lifespan Events, which are defined by an asynchronous function that can be used to define startup and shutdown logic. The example provided by the FastAPI documentation [19] is good for our scenario:

@asynccontextmanager
async def lifespan(app: FastAPI):
    """
    Lifespan function that runs as startup and shutdown function for FastAPI.
    """
    config_path = os.getenv("FASTTEXT_ML_CONFIG", "./config/config.yaml")
    logger.info(f"Loading configuration for FastText App from {config_path}.")
    config.config = OmegaConf.load(config_path)
    logger.info("Loading models for FastText App.")
    start = time.time()
    config.model = fasttext.load_model(config.config.model_location)
    end = time.time()
    logger.info(f"Models loaded, time elapsed: {end-start}")
    yield
    # The part after the yield is executed before shutdown, if we need to do
    # some memory cleaning or resource liberation

The lifespan function gets the configuration path from an environment variable and loads the configuration from there. The simple configuration file is defined like this:

model_location: ./models/cc.en.300.bin
model_precision: 5

It's simple but good enough for this demonstration. It has the path to the model file and a configuration parameter with the precision of the model when returning the word vector.

Once the configuration is loaded to the config attribute, the FastText model is loaded to the model attribute of the config object. Finally, it yields, which gives the control back to the FastAPI application. If we have to manage some extra steps before shutting down the application, that part will go after the yield instruction.

The FastAPI Object and the Router

After defining the lifespan, we can proceed to instantiate the FastAPI object. Furthermore, we can add a Router object that will enable us to manage the entire application under a subpath from the root. This feature is handy when we need to handle multiple applications in the same environment, such as services in a docker-compose orchestration:

app = FastAPI(
    title="FastText API",
    description="FastText API to generate word and sentence embeddings",
    version="0.1",
    lifespan=lifespan,
    openapi_url="/api/fasttext/docs/openapi.json",
    docs_url="/api/fasttext/docs/",
)
router = APIRouter(prefix="/api/fasttext")

# Route definitions

app.include_router(router)

We will serve our API under the /api/fasttext subpath. We will talk about the route definitions in the next section. They need to be written before the router is included in the FastAPI application. We also define the path to the OpenAPI documentation as being under the same subpath as the router.

The Health Route

First, we have the "health" route. It isn't required to define this in an application, but it is a good practice. The docker daemon uses it to check that the container is healthy, i.e., it's properly working and hasn't crashed.

@router.get("/health")
def health() -> str:
    """
    Let's Docker know if the container it's working.

    Returns
    -------
    str
        The "ok" string.
    """
    return "ok"

The return value isn't important; it's just returning a 200 response to tell the docker daemon that the container is currently running.

The Word Vector Route

This is the main route of our application. It's in charge of applying the model's magic:

@router.post("/word_vectors")
def word_vector(args: WordVectorArgs) -> dict[str, list[float]]:
    """
    Returns the vector for each word in the text.

    Parameters
    ----------
        args: WordVectorArgs
            The args for the POST request (JSON) as part of the payload.

    Returns
    -------
        dict[str, list[float]]
            A mapping between each unique word in the text and its
            corresponding vector.

    Raises
    ------
        HTTPException
            A 400 error if there are no words.
    """
    # Check the input isn't empty
    if len(args.text.strip()) == 0:
        error_message = "The request is empty"
        logger.info(error_message)
        raise HTTPException(status_code=400, detail=error_message)

    output = {}
    for word in sorted(set(args.text.strip().split())):
        word_vector = config.model.get_word_vector(word).tolist()
        if config.config.model_precision is not None:
            word_vector = [
                round(d, config.config.model_precision) for d in word_vector
            ]
        output[word] = word_vector

    return output

The function is called after a POST request to the /api/fasttext/word_vectors path. It takes the WordVectorsArgs argument populated with the text attribute. The function is simple in and of itself. It checks that the text isn't empty and returns a 400 error if it is. If it isn't, it returns the vector for each of the words in the text as a dictionary that will be returned as a JSON object by the server.

Running a Development Server

Now that we have the code ready, we can run a simple development server with uvicorn and the --reload flag. From the root of the project:

(.venv) $ uvicorn word_vectors.main:app --host=0.0.0.0 --port=3000 --reload

This will run the uvicorn server, and it will reload it every time there's some change in the code. We can request for the vectors with curl:

$ curl -X 'POST' \
  'http://localhost:3000/api/fasttext/word_vectors' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "text": "deploying with fastapi"
}'

We can also visit the FastAPI documentation at http://localhost:3000/api/fasttext/docs, where the example defined before should be ready for testing.

Deploying with Docker

After our application is working in a development environment, the next step is to build a Docker image to host it and a docker-compose file to orchestrate it.

The Dockerfile

In the application directory, word_vectors, there is a Dockerfile that builds an image for hosting the word_vectors application from the official Python 3.10 image. This is an extract from the most essential part of it:

FROM python:3.10-bullseye

# General Installations [...]

# Install uv
ENV VIRTUAL_ENV=/usr/local
ADD --chmod=755 https://astral.sh/uv/install.sh /install.sh
RUN /install.sh && rm /install.sh

RUN mkdir /word_vectors

WORKDIR /word_vectors

COPY ./word_vectors/requirements.txt requirements.txt

RUN /root/.cargo/bin/uv pip install --system --no-cache-dir --upgrade pip setuptools wheel
RUN /root/.cargo/bin/uv pip install --system --no-cache-dir --upgrade -r requirements.txt

# Cleaning [...]

First, the file installs all the necessary OS libraries to install the Python packages we need (it's skipped in the extract). Then, it installs the requirements using uv [15], which, as we saw before, saves a lot of time because of its fast dependency resolution. Finally, also skipped in the previous extract, all the files that were cached are cleaned to have a smaller image.

The docker-compose.yml File

The other important file in this process is the docker-compose.yml that will run the orchestration:

version: '3.7'

services:
  word_vectors:
    build:
      context: .
      dockerfile: ./word_vectors/Dockerfile
    container_name: word_vectors
    command: uvicorn word_vectors.main:app --host=0.0.0.0 --port=3000
    expose:
      - 3000
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:5000/api/fasttext/health"]
      interval: 3m
      timeout: 30s
      retries: 3
      start_period: 3m
    restart: always
    volumes:
      - ./:/word_vectors
  nginx:
    image: nginx:latest
    container_name: word_vectors_nginx
    ports:
      - 8080:8080
    volumes:
      - ./config/nginx:/etc/nginx/conf.d
      - ./logs:/var/log/nginx
    depends_on:
      - word_vectors

There are two services in this file: the application and the nginx, which will be our reverse proxy. Why do we use nginx here? For this particular example, there isn't much we are gaining from adding nginx, but suppose we want to have two very different services revolving around REST APIs for models. There are different ways to go about serving them. In one scenario, we can serve them in the same FastAPI application, but sometimes (more often than not), we need to serve them in different applications. Perhaps one is more complex and requires a Flask app, or we may need to serve a Django app for the UI and request from the model on the backend. For all these scenarios, the best option is to manage them using an Nginx reverse proxy.

The word_vectors service builds using the Dockerfile we defined before, internally exposes the port 3000 where the uvicorn is listening, and as a command, it directly uses the uvicorn command we defined before (without the reload option). It has also a health check that pings the health endpoint we created and finally mounts the root as part of the volumes.

The Nginx Service

The nginx service only pulls from the latest image of nginx, maps port 8080 to the container's port 8080, and mounts two volumes: The logs volumes are mounted in the root of the project to store all the access and error logs of nginx. The default configuration directory for nginx maps to the directory where we define our nginx configuration for the docker container:

upstream word_vectors {
    ip_hash;
    server word_vectors:3000;
}

server {
    location /api/fasttext/ {
        proxy_pass http://word_vectors/api/fasttext/;
    }

    listen 8080;
    server_name localhost;
    client_max_body_size 50M;
}

A simple configuration file that does the reverse proxy to the word_vectors application.

Building and Deploying

The final step is to build and deploy the model:

$ docker compose build
$ docker compose up -d

We can check the logs for the word_vectors application:

$ docker compose logs -f word_vectors

We can follow the Nginx access and error logs in the logs directory created when mounting the volumes of the Nginx service.

To access the documentation and run the same example from before, we go to http://localhost:8080/api/fasttext/docs/ or the port we set up in the docker-compose file.

Final Remarks

In this article, we explained how to set up and deploy almost any machine learning model with the help of FastAPI and Docker. The code is available for you to download and share on GitHub under the MIT license. Thanks for reading, and I hope this article gave you the basic setup you need for your deployment needs.