Learning Flappy Bird Agents With Reinforcement Learning

9 min readApr 4, 2022

Reinforcement Learning is arguably one of the most interesting areas of Machine Learning. It is the one that exploits the idea of generalizing from interaction and experience, something quite sensible when we think about the essence of learning and knowledge. As Sutton & Barto mention in their Reinforcement Learning: An Introduction, the idea that we learn by interacting with our environment is probably the first to occur to us when we think about the nature of learning. When an infant plays, waves its arms, or looks about, it has no explicit teacher, but it does have a direct sensorimotor connection to its environment. Exercising this connection produces a wealth of information about cause and effect, about the consequences of actions, and about what to do in order to achieve goals. Throughout our lives, such interactions are undoubtedly a major source of knowledge about our environment and ourselves (2018).

In environments, where the agent has to act and has no access to examples of desired behavior, it is sometimes able to learn from its own experience. The well known Flappy Bird game is an ideal case to show how traditional Reinforcement Learning algorithms can come in handy. As a simpler version of the game, we use the text flappy bird environment and train Q-Learning and SARSA agents.

The algorithms Q-learning and SARSA are well-suited for this particular game since they do not require a model of the environment and do not use transition probability distributions. Although these two algorithms are quite similar, they are also different in terms of how they learn action values, or put in simpler terms, they are different in terms of learning a more beneficial action for a given state. While SARSA is on-policy and learns action values relative to the policy it follows, Q-Learning is off-policy and does it relative to the greedy policy. Below represented are the update equations that are used in Q-Learning and SARSA.

Both of the algorithms require defining state space S, action space A and definition of the parameters learning rate (η) and discount factor (γ), as well as the epsilon rate for implementing epsilon-greedy action selection. In this particular implementation of the flappy bird environment we have state space defined over all possible combinations of the distance of the player from the center of the closest upcoming pipe gap along the two axes (x, y). Action space, on the other hand, consists of two actions only — flapping and staying idle.

To proceed, we install the game environment and the necessary libraries.

pip install git+https://gitlab-research.centralesupelec.fr/stergios.christodoulidis/text-flappy-bird-gym.gi%matplotlib inline
import numpy as np
from scipy.stats import sem
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D
from numpy.random import randn
from scipy import array, newaxis
from IPython.display import clear_outputimport os, sys
import gym
import time
import text_flappy_bird_gym
from tqdm import tqdm
import pickle
from collections import defaultdict
from mpl_toolkits.mplot3d import axes3d
import gym

Next, we define the class Agent that has the act method which defines the epsilon-greedy action selection policy.

class Agent:
    """
    The Base class that is implemented by
    other classes to avoid the duplicate 'act'
    method
    """
    
    ## take random action if random float is less than epsilon
    ## otherwise select action with highest Q-score for the given state
        
    def act(self, state): #epsilon-greedy policy
        
        
        if np.random.uniform(0, 1) < self.eps:
            action = env.action_space.sample()
            return action
        else:
            action = np.argmax(q[state])
            return action

The Q-Learning, SARSA and Random Agents inherit from the Agent class however each has their unique updating method that will update action values.

class Q_Agent(Agent):
    """
    The Q-Learning Agent Class
    
    """        
    
    def __init__(self, eps, step_size, discount):
        self.eps = eps
        self.step_size = step_size
        self.discount = discount
        
    def update(self):old_value = q[state][action]
        next_max = np.max(q[next_state])
        
        new_value = (1 - self.step_size) * old_value + self.step_size * (reward +  self.discount * next_max)
        q[state][action] = new_valueclass SarsaAgent(Agent):
    """
    The SARSA Agent Class
    
    """    
    
    def __init__(self, eps, step_size, discount):
        self.eps = eps
        self.step_size = step_size
        self.discount = discount
        
    def update(self):
                
        q[state][action]+= self.step_size * \
        (reward + self.discount * (q[next_state][next_action]) - q[state][action])class RandomAgent():
    """
    The Random Agent Class
    
    """
    def __init__(self, eps, step_size, discount):
        self.eps = eps
        self.eps=1
        self.step_size = step_size
        self.discount = discount
        
        
    def act(self, state): 
        
            action = env.action_space.sample()
            return actiondef update(self):
        pass

Training the agents

agents = {
    "Q-learning": Q_Agent,
    "Sarsa": SarsaAgent,
    "RandomAgent": RandomAgent}all_the_q_tables = {}
all_reward_sums = {} # Contains sum of rewards during episodefor algorithm in ["Q-learning", "Sarsa", "RandomAgent"]:
    
    q = defaultdict(lambda: np.zeros(2))  # The dict of action-value estimates.
    
    all_reward_sums[algorithm] = []
    current_agent = agents[algorithm](eps=0.2, step_size = 0.7, discount=0.95)
    total_epochs = 0for i in range(100000):
        
        state = env.reset()
        done = False
        total_reward = 0while not done:
            
                action = current_agent.act(state)# Apply action and return new observation of the environment
                next_state, reward, done, info = env.step(action)
            
                #For SARSA acquiring the on-policy next action
                next_action = current_agent.act(next_state)if done == True:
                    reward = -1
 
                # Update total reward 
                total_reward += reward#Update q values table
                current_agent.update()
                
                state = next_stateif done:
                    breakall_reward_sums[algorithm].append(total_reward)
        env.close()
    all_the_q_tables[algorithm] = q

In an initial training over 100000 epochs, the discount rate was defined as 0.95, the learning rate was 0.7 and epsilon was defined to be 0.2. The results show that in general, both Q-Learning and SARSA agents perform much better than a random agent which randomly chooses actions.

for algorithm in ["Q-learning", "Sarsa", "RandomAgent" ]:
    plt.plot(all_reward_sums[algorithm], label=algorithm)
plt.xlabel("Episodes")
plt.ylabel("Sum of\n rewards\n during\n episode",rotation=0, labelpad=40)
plt.xlim(0,1000)
plt.ylim(0,400)
plt.legend()
plt.show()

The difference in the performance in these algorithms can be understood when plotting the learned state values.

State Value Plots

q_policy = dict((k,np.max(v)) for k, v in all_the_q_tables['Q-learning'].items())x_list = []
y_list = []
z_list = []
for key in q_policy.keys():
    x_list.append(key[0])
    y_list.append(key[1])
for value in q_policy.values():
    z_list.append(value)fig = plt.figure()
ax = fig.gca(projection='3d')
ax.plot_trisurf(x_list, y_list, z_list)
plt.show()

Q-learning seems to find no difference between states that are far away from the pipe gap and seems to give high value to all the states that are around the gap. SARSA is quite similar, however much ragged in terms of the values given to states that are close to each other and intuitively should not have highly differing state values.

Under some common conditions, both algorithms are expected to converge to the real value function, but at different rates. The graphs below represent that SARSA converges faster than Q-learning for this particular task, however it also underperforms Q-learning.

In order to check sensitivity to parameters, we then vary the learning rate and discount factor in an interval of [0,1] and plot the recorded results. The parameter learning rate η ∈ [0, 1] means how much of the old Q-value we want to take into account. For example, if η = 0.8 then we use 20% of the old value we are updating and 80% of the new future and actual reward. The parameter discount factor γ ∈ [0, 1] means how much of the next possible reward we want to take into account.

Experimenting with step size

agents = {
    "Q-learning": Q_Agent,
    "Sarsa": SarsaAgent
}env = gym.make('TextFlappyBird-v0',  height = 10, width = 18, pipe_gap = 4)all_the_q_tables = {}
all_reward_sums = {}
step_sizes = np.linspace(0.1,1.0,10)#agent_info = {"num_actions": 4, "num_states": 48, "epsilon": 0.1, "discount": 1.0}
#env_info = {}
#num_runs = 100
num_episodes = 1000
all_reward_sums = {}for algorithm in ["Q-learning", "Sarsa"]:
    for step_size in step_sizes:
            reward_for_step = 0 
            current_agent = agents[algorithm](eps=0.2, step_size = step_size, discount=0.95)
            
            for episode in range(num_episodes):
                state = env.reset()
                done = False
                total_reward = 0while not done:action = current_agent.act(state)# Apply action and return new observation of the environment
                    next_state, reward, done, info = env.step(action)#For SARSA acquiring the on-policy next action
                    next_action = current_agent.act(next_state)if done == True:
                        reward = -1# Update total reward 
                    total_reward += reward#Update q values table
                    current_agent.update()state = next_stateif done:
                        break
                reward_for_step += total_reward
                
            all_reward_sums[(algorithm, step_size)] = []    
            all_reward_sums[(algorithm, step_size)].append(reward_for_step/num_episodes)
            env.close()for algorithm in ["Q-learning", "Sarsa"]:
    algorithm_means = np.array([np.mean(all_reward_sums[(algorithm, step_size)]) for step_size in step_sizes])
    algorithm_stds = np.array([sem(all_reward_sums[(algorithm, step_size)]) for step_size in step_sizes])
    plt.plot(step_sizes, algorithm_means, marker='o', linestyle='solid', label=algorithm)
    plt.fill_between(step_sizes, algorithm_means + algorithm_stds, algorithm_means - algorithm_stds, alpha=0.2)plt.legend()
plt.xlabel("Step-size")
plt.ylabel("Sum of\n rewards\n per episode",rotation=0, labelpad=50)
plt.xticks(step_sizes)
plt.show()

The plot for learning-rate experiment shows that a low learning rate performs quite badly for Q-learning, whereas SARSA performs better if only a small part of the old Q-value is taken into account when computing the update. In addition, Q-learning is not very sensitive to the choice of learning rate in the mid-upper range of [0.3, 0.9].

The plot for varying the discount rate shows that Q-learning is not as sensitive to the choice of the discount factor while SARSA performs significantly better when the next possible value is fully taken into account without any discounts.

Apart from the TextFlappyBird-v0 environment, there’s also a second version available. The two versions of the TFB environment are different only in the yielded observations. The TextFlappyBird-screen-v0 returns the array that represents the current state of the game screen encoded as integers while the TextFlappyBird-v0 returns the horizontal and vertical distance of the player to the closest upcoming pipe gap. The main limitations of TextFlappyBird-screen-v0 is that while in the TextFlappyBird-v0 the state space is very small with only a dozen of possible combinations of the horizontal and vertical distance of the player from the pipe gap, the state space becomes very large when defining it as the array representation of the whole screen. This means that the current agents can have some computational limitations and attending each state infinitely many times can become practically impossible.

With an implementation of the original flappy bird game environment available, the current SARSA and Q-learning agents can be used for the “FlappyBird-v0” environment however for the RGB-arrays (images) representations Deep Q-Learning could replace the regular Q-table with a neural network. Rather than mapping a state-action pair to a q-value, a neural network would map input states to (action, Q-value) pairs.

To sum up, both SARSA and Q-learning agents performed significantly better than a random agent in solving the Text Flappy Bird environment. Although with a slower convergence rate, Q-learning proved to be a better approach for solving the text flappy bird environment as shown by the recorded rewards that were on average much higher than the ones attained by SARSA. In addition, Q-learning was found to be less sensitive to hyper-parameter choice of learning rate and discount factor.

Resources:

Implementation of two OpenAI Gym learning environments of a simple unit-element player version of the Flappy Bird: https://gitlab-research.centralesupelec.fr/stergios.christodoulidis/text-flappy-bird-gym

Sutton, R. S., Barto, A. G. (2018 ). Reinforcement Learning: An Introduction. The MIT Press.

Learning Flappy Bird Agents With Reinforcement Learning

Written by Aspram Grigoryan