top of page

Introduction to Reinforcement Learning

Updated: Jun 13

Machine learning (ML) represents one of the most significant advancements in technology, driving innovation across various sectors by enabling computers to learn from data and make intelligent decisions. Among the various approaches in ML, reinforcement learning (RL) stands out due to its unique ability to learn optimal actions through interactions with the environment. In this article, we will delve into the intricacies of reinforcement learning, exploring its principles, mechanisms, and real-world applications. 

ML Taxonomies

First, it is essential to understand how RL fits into ML techniques. One way to categorize them is based on their way of learning. The primary types are supervised, unsupervised, and reinforcement learning, each with its unique learning approach. Let’s summarize them to understand better their differences.

Machine Learning Tazonomies
Machine Learning Taxonomies. Source [5].

Supervised Learning

It is the most common type. It involves training a model on a labeled dataset, meaning that each training example is paired with an output label. The goal is for the model to learn a mapping from inputs to outputs that can be used to predict the labels of new, unseen data. Within supervised learning, there are two main subcategories: classification and regression.


Here the model learns to predict discrete labels. Some examples of classification include:

  • Image Classification: This involves assigning a label to an image from a set of predefined categories. For instance, determining whether an image contains a cat or a dog.

  • Diagnosis: In the medical field, classification models can be used to diagnose diseases based on patient data and medical imaging.

  • Fraud Detection: Financial institutions use classification models to identify fraudulent transactions by analyzing patterns in transaction data.

A representation of a neural network for classification
A representation of a neural network for classification


Regression, on the other hand, is used for predicting continuous values. It aims to predict a numerical value based on input features. Common applications include:

  • Weather Prediction: Using historical weather data to forecast future weather conditions.

  • Market Prediction: Predicting stock prices or market trends based on historical financial data.

  • Population Growth: Estimating future population growth based on current demographic data.

Regression example over synthetic data
Regression example over synthetic data

Unsupervised Learning

It involves training a model on data without labeled responses. The model tries to learn the underlying structure of the data. This type of learning is useful for tasks where the aim is to explore the data and identify patterns without specific predictions in mind. Key techniques in unsupervised learning include clustering and dimensionality reduction.


It is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups. Examples:

  • Recommendation Systems: Clustering can be used to group similar items together, enhancing the effectiveness of recommendation algorithms.

  • Customer Segmentation: Businesses use clustering to segment customers into different groups based on purchasing behavior, which helps in targeted marketing strategies.

Clustering example with synthetic data.
Clustering example with synthetic data.

Dimensionality Reduction

Dimensionality reduction is used to reduce the number of random variables under consideration, obtaining a set of principal variables. This is crucial for visualization and noise reduction. Applications include:

  • Data Visualization: Reducing data dimensions to visualize complex datasets in two or three dimensions.

  • Data Structure Exploration: Simplifying data to explore its structure and underlying patterns more easily.

  • Noise Elimination: Removing noise from data to improve the performance of machine learning models.

A world map is a dimensional reduction of Earth
A world map is a dimensional reduction of Earth

Reinforcement Learning

Finally, our star of the show implements agents who learn to make decisions by performing actions in an environment to achieve some notion of cumulative reward. Unlike supervised and unsupervised learning, reinforcement learning focuses on learning from the consequences of actions rather than from a fixed dataset. Some use cases include:

  • Real-Time Decision Making: It can be used in scenarios where decisions need to be made in real-time, such as autonomous driving.

  • Gaming: It's widely used in developing game-playing AI that can learn and improve strategies over time.

  • Skill Acquisition: Automated systems use RL to acquire new skills through trial and error.

  • Robotics: It enables robots to learn complex tasks by interacting with their environment and optimizing their actions based on feedback.

We can say that unsupervised and supervised are like learning by studying and reinforcement is by experience. Let’s dive deeper into RL to explain better why.

How Reinforcement Learning works

Let's begin by understanding the fundamental components involved in reinforcement learning. 


The agent is the decision-maker, it could be anything from a robot navigating a room to a software program playing a game. The agent has the following key characteristics:

  • Sensors to Observe the Environment: The agent is equipped with sensors that allow it to perceive the state of its surroundings. These sensors gather information that is crucial for the agent to understand its current situation and make informed decisions.

  • Capability to Perform Actions: The agent can take various actions that influence the state of its environment. These actions are the primary means through which the agent interacts with and alters its surroundings.

  • Immediate Numerical Rewards: Each action taken by the agent results in an immediate numerical reward. This reward serves as feedback, indicating how beneficial or detrimental the action was in achieving the desired outcome. The objective of the agent is to learn a policy, which is a strategy for choosing actions that maximize the cumulative reward over time.

By continually interacting with its environment, receiving feedback in the form of rewards, and adjusting its actions accordingly, the agent gradually improves its decision-making strategy. This process, fundamental to reinforcement learning, allows the agent to develop sophisticated behaviors that can solve complex problems in dynamic and uncertain environments.

How RL agent interacts with the environment
How the agent interacts with the environment

The primary task of an agent in reinforcement learning is to learn an effective strategy or policy that maximizes cumulative rewards over time. This policy guides the agent in selecting actions based on the current state of the environment. We can think of the strategy as a function defined as follows:

RL agent strategy formulation
Strategy formulation

Rewards often come with delays. For example, winning a game provides a reward only at the end, not for each move. This delay in rewards means the agent must learn to discern which actions in a sequence were beneficial and contributed to the ultimate reward. This task is challenging because the agent must trace back from the final outcome to identify and reinforce the key actions that led to success. The agent must effectively attribute the delayed rewards to the correct actions taken throughout the process. This is known as the credit assignment problem [1]. 

Another crucial aspect of reinforcement learning is exploration. Unlike supervised learning, where the training data is fixed, in reinforcement learning, the training data is determined by the actions the agent chooses. This means the agent must explore a variety of actions to gather sufficient data about their consequences. Exploration is essential for discovering new strategies and improving the agent's performance. A common approach to face these challenges is the Markov Decision Process (MDP), which will be explained in the next section.

Markov Decision Process

MDP is a way to model the environment in which the agent operates. It provides a formal structure for defining the problem and consists of the following components:

Set of States (S)

The set of states denoted as S, represents all possible situations or configurations that the environment can be in. Each state provides the agent with the necessary context to make decisions. For example, in a robotic navigation task, a state might include the robot's position and the locations of obstacles.

Set of Actions (A)

The set of actions denoted as A, includes all possible actions the agent can take. These actions are the means by which the agent interacts with and influences the environment. For instance, in a game, actions might include moves such as "up," "down," "left," and "right."

Transition Function (T)

The transition function, denoted as T, describes how the environment changes in response to the agent's actions. Formally, it is a mapping from the current state and action to a probability distribution over the next states:

 T: S x A → S

This function captures the dynamics of the environment, indicating the likelihood of reaching a particular next state given the current state and action.

Reward Function (R)

The reward function, denoted as R, assigns a numerical reward to each state-action pair: 

R: S × A → R

This function provides immediate feedback to the agent about the desirability of taking a specific action in a given state. The goal of the agent is to maximize the cumulative reward over time, guiding it towards more favorable actions.

By defining an environment as an MDP, reinforcement learning algorithms can systematically approach the problem of learning optimal policies. This structured approach allows the agent to understand the consequences of its actions, learn from feedback, and improve its decision-making capabilities. In the next sections, we will explore how these components come together to enable the agent to learn and adapt through the reinforcement learning process.

Grid World

To illustrate how reinforcement learning operates within the framework of a MDP, consider a simple example known as Grid World. This example helps to clarify the concepts by providing a tangible scenario in which an agent interacts with its environment.

The environment is represented as a grid with a finite number of states and actions. For simplicity, let's consider a Grid World with 6 states arranged in a 2x3 grid. Each cell in the grid represents a state the agent can occupy.


The set of states S in this Grid World consists of the 6 distinct positions the agent can occupy within the grid.


The set of actions A includes the possible moves the agent can make: left (←), up (↑), right (→), down (↓). However, not all actions are available in every state. For example, if the agent is in a state on the edge of the grid, it cannot move beyond the boundary.

  • From the top-left corner, moving up or left is not possible.

  • From the bottom-right corner, moving down or right is not possible.

Transition Function

The transition function T defines the rules for moving between states. For example, if the agent is in the middle of the grid and takes the action to move right (→), the transition function would update the agent's state to the cell immediately to the right.

Reward Function

The reward function R assigns values to state-action pairs, guiding the agent towards certain behaviors. For instance, moving towards a goal state might yield a positive reward, while moving into an obstacle or boundary might yield a negative reward or zero reward.  


A policy, denoted as 𝜋, is a strategy that the agent uses to decide which actions to take in different states. Formally, it is a mapping from states to actions: 

𝜋: S→A

It determines the agent's behavior in each state.

Value Function

The value function, denoted as V𝜋(s), represents the expected cumulative reward the agent can achieve starting from state s and following policy 𝜋 thereafter. It is defined as:

Value function definition
Value function definition

The discount factor, denoted as ɣ, determines the importance of future rewards. It ranges between 0 and 1 (0 ≤ ɣ < 1). A value close to 0 places more emphasis on immediate rewards, meaning the agent prioritizes short-term gains. Conversely, a value close to 1 gives more weight to future rewards, encouraging the agent to consider long-term benefits. The discount factor thus balances the agent's focus between short-term and long-term rewards, shaping its strategy and decision-making process.

In the following figure, you can see a specific case of this Grid World. The strategy was generated randomly, and the values you see in each cell of the value function grid are the rewards obtained if you start on that cell and follow the strategy 𝜋. Each value was calculated through the value function.

Grid World example
Grid World example

The Agent's Objective

The agent's primary goal is to learn the optimal value function V*, which results from following the optimal policy 𝜋*. This optimal value function represents the highest possible expected cumulative reward that the agent can achieve starting from any given state and following the best possible strategy.

To learn the optimal policy, the agent also utilizes the action-value function, denoted as Q(s,a). This function represents the maximum expected gain (or reward) achievable starting from state ss and taking action aa, followed by the optimal policy 𝜋* thereafter. In the next figure you can see how these functions are formally defined:

Optimal value function and policy definitions along with function Q definition.
Optimal value function and policy definitions along with function Q definition.

By learning Q(s, a), the agent can determine the best action to take in any given state to maximize its cumulative reward.

Q-Learning Algorithm

Q-Learning is a popular model-free algorithm that enables an agent to learn the optimal policy for any given task. It aims to learn the optimal action-value function Q(s, a). Here’s how the algorithm works:

Initialize Q(s, a) 
Repeat (for each episode): 
	Initialize state s 	
	Repeat (until state s is terminal):
		Choose action a in state s 
		Execute action a  
		Observe reward r and new state s' 
		# Update Q-value 
		Q(s, a) ← Q(s, a) + α [r + γ max_a' Q(s', a') - Q(s, a)] 
		# Update state 
		s ← s'
# Define policy 
π(s) = argmax_a Q(s, a)


-- α is the learning rate, determining the extent to which new information overrides the old.

-- γ is the discount factor, valuing future rewards.

-- max_{a'} Q(s', a') represents the maximum estimated Q-value for the next state s'.

Policy Derivation

Define Optimal Policy 𝜋: Once the Q-values are learned, derive the optimal policy 𝜋(s) by selecting the action that maximizes the Q-value for each state s:

deriving the optimal policy
Deriving the optimal policy

In summary, this algorithm enables the agent to learn the optimal policy by iteratively updating the Q-values based on the rewards received and the expected future rewards. By following this process, the agent can effectively navigate its environment, learning which actions yield the highest cumulative rewards and ultimately achieving optimal performance in its tasks.

Finally, if you want to have a practical example to play with, here[4] you will find a very good one. By changing the parameters α, γ and epsilon (how much it explores or follows the known way) you will get different Q-Learning outputs!


In this article we have provided an introductory overview of reinforcement learning, explaining its fundamental concepts and illustrating them with the Grid World example. We discussed the importance of learning optimal policies and value functions and introduced key algorithms like Q-Learning. However, this is just the beginning. To fully appreciate the depth and power of RL, it is essential to delve deeper into the exploration-exploitation dilemma[3], which balances the need for an agent to explore new actions and exploit known rewarding actions effectively. Additionally, contemporary RL has advanced significantly with the integration of deep learning, resulting in more robust and scalable implementations capable of solving highly complex problems. Future articles will further illuminate these aspects, showcasing how deep reinforcement learning is transforming various fields and applications.


[1] Minsky, M. (1961). Steps toward artificial intelligence. Proceedings of the IRE, 49(1), 8-30.

[2] Gimelfarb, M., Sanner, S., & Lee, C. G. (2020). {\epsilon}-bmc: A bayesian ensemble approach to epsilon-greedy exploration in model-free reinforcement learning. arXiv preprint arXiv:2007.00869.

[5] Maru, M., & Swarndeep, S. (2019). A Novel Approach for Improving Breast Cancer Risk Prediction using Machine Learning Algorithms: A Survey.


bottom of page