Introduction to Reinforcement Learning — Part 1

8 min readJun 22, 2018

In this series, we’ll see what reinforcement learning is, how it is different from other learnings, the components of RL etc. Hope you’ll enjoy along the way. So let’s start!

AI plays Breakout

This is amazing!

What is Reinforcement Learning (RL)?

Reinforcement learning is the science of decision making.

You have probably enjoyed playing games like Call of duty, Battlefield etc. and even some classic ones like Mario and Pacman. While playing games we do some repetitive tasks like:

Observe the situation → Take an action/decision → Evaluate that if the action was good or bad.

This repetition helps us build a memory of which actions we did were good and which were bad and this helps us to get better and better at the game. And not only in games, in fact, this repetition helps us at pretty much everything like driving, playing football, a baby learning to walk etc.

Reinforcement learning is used to solve the problems that are reward based. The agent learns by trial and error and tries to get the maximum possible reward by performing certain actions in the environment.

RL is the intersection of various fields :

Engineering: Optimum Control (finding an optimal control point for a system).
Mathematics: Operations Research (deals with the application of advanced analytical methods to help make better decisions).
Economics: Game Theory, Utility Theory etc.
Neuroscience: The human brain has a remarkable ability for adapting to environmental changes.

RL vs Supervised and Unsupervised learning

As in supervised learning, the training data and the labels are given and the aim is to learn the mapping from training data to the labels and in unsupervised learning, only the training data is given and the aim is to model the underlying structure and to learn more about the data. But in RL:

There is no supervisor. Learning is based on trial and error.
It has a feedback system that is, the rewards(instantaneous or long-term).
The learning process is sequential that is, there is time relevancy.
As the agent influences the data around it the data changes every time.
Learning by doing.

RL maps situations to actions so as to maximise a scalar reward.

The RL problem

We’ll see the fundamentals which build up the RL problem.

Rewards: It is a scalar feedback signal(single number) which tells us that how well the agent is doing and the agent tries to maximise the total reward accumulated over time.

Examples:

i) When playing games, the score increases by +1 when the enemy is shot or +10 when the enemy is dead and -1 when oneself is shot or -10 when oneself is dead.

ii) A dog receives a reward from a trainer for completing a certain task and no reward for failing to do so.

Rewards can be positive, negative or 0.

Goals and actions: The goal of reinforcement learning problem is to maximise the total reward by taking certain actions with respect to policy. It is time dependent. The goal could be to get an immediate reward or sacrificing the immediate rewards to reach a long-term goal.

For Example: Agent working on a long-term strategy in the game of chess where it sacrifices some pieces to do a checkmate on the opponent.

History: A history is a sequence of Observations, Actions and Rewards. Like in a game, everything before your present will be the history.

H(t) = A(1), O(1), R(1),……, A(t), O(t), R(t)

State: It is the summary of the information that is used to determine what happens next that is, the current situation in an environment. Like a frame in a Pacman game is a state which describes the opponents location, ones own location, the location of the rewards etc.

Agent and environment state:

The agent’s state is private to itself. The information in agent’s state is used by RL algorithms to pick actions.

The environment gets an action from the agent, emits observations and rewards to the agent. The environment state may or may not be visible to the agent and even be useless for the agent.

Markov state:

The Markov state follows Markov property which tells us that, the future depends only on the present and is irrespective of the past. Only the current state is relevant for what follows next. For example, if you are running in a field and what happens next that is, if your foot hits a stone and you fall down or you keep increasing your pace depends upon the current state (attention on the field, current running form etc.), not on the past.

You can understand the terms used using the gameplay of Mario:

This whole world of Mario game(pipes, bricks, boxes with points etc.) is the environment.

2. State is a part of the environment which can be this frame above.

3. The agent is the Mario.

4. Actions are jump, sit, left, right, shoot.

5. Reward can be the mushrooms that make Mario bigger and gives him the power to shoot. Coins are the immediate rewards.

6. The short-term goal is to reach at the end of a world and the long-term one could is to save the princess from the dragon.

Fully Observable Environment: In a fully observable environment the agent has the full information about the environment. It is a Markov Decision Process (MDP).

           Agent state = Environment state = Markov state

Partially Observable Environment: Here the agent does not have the full information of the environment, instead the agent builds its own memory of the environment using the history. It is a Partially Observable Markov Decision Process (POMDP). Like if a robot is mounted with a camera and GPS attached and is left open, the robot doesn’t know about the environment and its location at first, but it tries to build a memory of the environment by observing it.

                    Agent state ≠ Environment state

Components of an RL Agent

The major components that build up an RL agent are:

Policy:

The policy is the behaviour of the agent. It is a set of rules that the agent follows to get the most reward. It maps the state to an action. It can be deterministic or it can be stochastic(for a given state, the probability of taking some action).

2. Value function:

It is the prediction of the expected future reward. It tells about if the actions taken following a policy gets us the highest reward. Suppose we performed two actions, action 1 and action 2 and ended up in states, state 1 and state 2, how do we choose the best state? The value function helps to choose the state which has the best expected total reward. Or suppose following a certain path in a game how much expected reward we’ll get.

This equation tells the expected value, v in a state, s following a policy, 𝛑. t is the time step of the current state and subsequently t+1, the next one and so on.

Gamma, 𝛄 is the discount factor. It affects if we care about the current/later states. It is 0 if just we want to care about the current state, 1 if we want to consider every state as relevant and 0.99 if we want to consider the states near more relevant and consider the states that are far away less relevant .

3. Model:

It is the agent’s representation of the environment. It helps to learn about the environment and then figure out a plan for what to do next. It can be divided into two states:

a) Transition: It predicts the next state. e.g., given the dynamics such as the velocity, position of an object what will the environment do next.

This equation tells us the probability of being in the next state, s’ given the current state, s and action, a.

b) Reward: It predicts the immediate reward. Or how much reward will the object get following an action.

This equation tells us the expected reward, R we will get given the current state, s and action, a.

Building a model of the environment is optional. There are effective algorithms in RL that doesn’t use models.

Types of RL agents

Value-based: The policy is not stored but have explicit value function. All actions depend on the value function and the best action is picked greedily.
Policy-based: The policy is explicit but there is no storing of the value function. It sees how well it does by picking some actions and how much reward it gets.
Actor-critic: Both policy and value function are stored.
Model-free: No model is used that is, the agent doesn’t see how the environment behaves and either the value function or the policy is used.
Model-based: Model is used along with value function or policy.

              build model → use policy/value function

Exploration Exploitation problem

Suppose in a treasure-hunt game, our goal is to reach to the treasure as fast as possible and we go on and try different paths. If we try many different paths, explore them, see if the treasure is there or not then this won’t guarantee that we would win the game and also if we choose a path and go on exploring it to the depth, then this strategy will also not guarantee that we will win the game. The best approach would be a mixture of both. Exploit and also explore a little bit.

Exploration means to get more information about the environment.

Exploitation means to make use of the information already found to maximise the reward.

This tradeoff of exploring and exploiting in an RL problem needs to be balanced without losing much reward.

Examples:

Restaurant Selection —

Exploitation: Going to the same restaurant again and again.

Exploration: Trying out a new restaurant every time.

2. Game Playing —

Exploitation: Playing the best move you think.

Exploration: Trying out new moves or experiment with new moves.

We’ll see more concepts in Reinforcement Learning in the part 2 available here.

Please click on the 👏 button if you liked the post and hold it for giving more love.

If you wish to connect:

Twitter Instagram LinkedIn Github