A Tutorial on Multi-Armed Bandits: To Explore or Exploit?

A Tutorial on Multi-Armed Bandits: To Explore or Exploit?
Photo by Hello I'm Nik / Unsplash

The Multi-Armed Bandit problem is a classical Reinforcement Learning that shows the exploration-exploitation trade-off. In this tutorial, we will look at the problem of advertorial design and we will answer the following question: what advertorials will lead to the highest gains?

Suppose we have an advertorial with a few parameters:

  • The color of the button (green, red or blue)
  • The font size (small or large)
  • The font style (bold, underlined or normal)

What design will work the best? Notice that it is not efficient to randomly choose values for our design. In that case, we can observe a lot of information (explore), but we will not be able to exploit the best performing designs (exploit). On the other extreme, we can only exploit the best working solution, but then we won’t be able to test out new designs and the performance of our advertorial might drop. What is the best trade-off between exploring and exploiting?

Now, we will create an agent which will click the advertorial in some cases but won’t click the advertorial in other cases. The agent has a preference for certain advertorials:

import numpy as np
import json

# All the options we can choose from
options = {
    'button': ['red', 'green', 'blue'],
    'size': ['small', 'large'],
    'style': ['bold', 'underline', 'normal']

class SimpleAgent:
    """This agent has a preference for large, red buttons with either a bold or normal style."""
    def __init__(self):
        self.parameters = {
            'button': {
                'red': 0.8,       # The agent will click on red buttons with a probability of 80%
                'green': 0.1,     # The agent will click on green buttons with a probability of 10%
                'blue': 0.0       # The agent will never click on blue buttons
            'size': {
                'small': 0.1,     # The agent will click when the advertorial has a small font with a probability of 10%
                'large': 0.9      # The agent will click when the advertorial has a large font with a probability of 90%
            'style': {
                'bold': 0.8,      # The agent will click when there is a bold font with a probability of 80%
                'underline': 0.1, # The agent will click when there is an underlined font with a probability of 10%
                'normal': 0.9     # The agent will click when there is a normal font with a probability of 90%
    def does_click(self, design):
        """This function returns True if the agent does click on the advertorial, otherwise False."""
        for param in design:
            random_number = np.random.random()
            design_value = self.parameters[param][design[param]]
            # Determine whether the agent has clicked for this particular parameter
            # The agent has not clicked when the random number is less than the probability for the given parameter
            if random_number < 1. - design_value:
                return False
        # The agent will click when this statement is reached!
        return True
def compute_ctr(agent, design, n=100):
    """Compute the Click Through Rate (CTR) for a given agent and a given design and for a number of runs."""
    clicks = 0
    for _ in range(n):
        clicks += int(agent.does_click(design))
    return float(clicks) / float(n)

# Setup the agent
agent = SimpleAgent()

# Create a design
design = {
    'button': 'red',
    'size': 'large',
    'style': 'bold'

# Now we will test the design
# We now that our agent 
# Compute the CTR (Click Through Rate)
print(f'The design got a CTR of {compute_ctr(agent, design) * 100:.0f}% for design {json.dumps(design)}')

# Create a second design with a small font - such that the agent will click less often
design_small_font = {
    'button': 'red',
    'size': 'small',
    'style': 'bold'

# Now we will test this design
print(f'The design got a CTR of {compute_ctr(agent, design_small_font) * 100:.0f}% for design {json.dumps(design_small_font)}')

This code has the following output (and it will slightly differ at every run since it is probabilistic):

The design got a CTR of 58% for design {"button": "red", "size": "large", "style": "bold"}
The design got a CTR of 10% for design {"button": "red", "size": "small", "style": "bold"}

You can see that the agent prefers the first design more than the second design since the font size differs. Can our algorithm also spot this pattern?

The algorithm: Thompson Sampling

Now we will take a look at Thompson Sampling: an algorithm that is a solution to the Multi-Armed Bandit problem. The algorithm will start with a prior (Beta) distribution in which it will select each design option with the same probability. A red button will appear as often as a green button and a blue button. Then, the algorithm will observe clicks and will update the probability distribution accordingly. If there are more clicks while there is a red button, then the red button will be shown more often. The following code implements the sampling method:

class ThompsonSampler:
    def __init__(self, options):
        self.params = {
            option: {
                value: {
                    False: 1.,
                    True: 1.
                for value in options[option]
            for option in options
    def observe(self, design, has_clicked):
        for option in design:
            self.params[option][design[option]][has_clicked] += 1
    def __str__(self):
        return json.dumps(self.params, indent=4)
sampler = ThompsonSampler(options)
design = {
    'button': 'red',
    'size': 'large',
    'style': 'bold'
sampler.observe(design=design, has_clicked=True)

This code produces the following output:

    "button": {
        "red": {
            "false": 1.0,
            "true": 2.0
        "green": {
            "false": 1.0,
            "true": 1.0
        "blue": {
            "false": 1.0,
            "true": 1.0
    "size": {
        "small": {
            "false": 1.0,
            "true": 1.0
        "large": {
            "false": 1.0,
            "true": 2.0
    "style": {
        "bold": {
            "false": 1.0,
            "true": 2.0
        "underline": {
            "false": 1.0,
            "true": 1.0
        "normal": {
            "false": 1.0,
            "true": 1.0

It started with a prior distribution where all values were set to “1”. After observing a click for the specified design, the counts were adjusted accordingly.

Design generator

Now, we will create a design generator that uses the sampler as input and generates a design based on the output of the sampler:

class DesignGenerator:
    def __init__(self, sampler):
        self.sampler = sampler
    def get_scores(self, option):
        scores = []
        for value in self.sampler.params[option]:
            a = self.sampler.params[option][value][True]
            b = self.sampler.params[option][value][False]
            scores.append(np.random.beta(a, b))
        scores = np.array(scores)
        scores /= scores.sum()
        return scores
    def generate(self):
        design = {}
        for option in self.sampler.params:
            items = list(self.sampler.params[option].keys())
            scores = self.get_scores(option)
            design[option] = np.random.choice(items, p=scores)
        return design
sampler = ThompsonSampler(options)
design_generator = DesignGenerator(sampler)

As you can see, it will generate a random design for us. But, the designs will become more tailored after observing more clicks. The following design was generated after running the code:

{'button': 'green', 'size': 'small', 'style': 'normal'}

The results

Will Thompson sampling be able to figure out the preferences of the agent? With the following code, we can observe the statistics for the button colour:

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
sampler = ThompsonSampler(options)
advertorial_maker = DesignGenerator(sampler)
button_scores = []
for i in range(300):
    button_scores.append(dict(zip(options['button'], advertorial_maker.get_scores('button').tolist())))
    design = advertorial_maker.generate()
    does_click = agent.does_click(design)
    sampler.observe(design, does_click)

df_button = pd.DataFrame.from_records(button_scores)
df_button.plot(style='.', color=['red', 'green', 'blue'], ms=20, alpha=0.5)
plt.title('Button color')
plt.xlabel('Number of observations')
plt.ylabel('Sampling probability')

This code observes the agent 300 times. And yes, it seems that the algorithm correctly predicts that a red button should be more important than a green button or a blue button. Also, notice that the algorithm still explores: it will still be able to produce a green button in some cases:

The statistics for the button color.
The statistics for the button color.

Also, the font size statistics can be found:

Font size statistics.

Notice that the algorithm has found a preference for large font sizes after 300 observations. And lastly, we can also see the statistics for the font style:

Statistics for the font style.

This indeed matches the preference of our agent! The agent liked the normal style and the bold style but dislikes underlined fonts.


Thank you for following this tutorial! In this tutorial, we have seen a solution for the Multi-Armed Bandit problem and applied it for advertorial optimisation. If you liked this blog post, please share it or leave a comment below.