In this article, we will try to win a (fictive) lottery called Win-Win. In fact, you will learn how to do a MAP (maximum a posteriori) estimation on a Bernoulli distribution. Don’t worry if you don’t know all these words, everything will be explained. If you already know some of the terms, then you can skip these parts.

## Basic probability theory

The first thing we need to know is how to calculate with uncertainty. Now a variable is assigned an extra property, namely its uncertainty. Suppose $latex X$ is such a random variable and lets say we are for $latex \frac{8}{10}$ (read: $latex 80\%$) certain that it equals one. Then we write $latex P(X=1) = \frac{8}{10}$. There are certain axioms (rules) that are always true. Note that the maximum certainty is $latex 100\%$ and the minimum certainty is $latex 0\%$. In other words: $latex 0 \le P(X) \le 1$ (this is sloppy notation, but it explains the main concept).

Suppose there are two random variables $latex X$ and $latex Y$. Then we can talk about them both being true: $latex P(X, Y)$. This is called the joint probability. Furthermore, there is a conditional probability: $latex P(X|Y)$ (read: the probability of $latex X$ given $latex Y$). For example, you can ask what is the probability of rain ($latex r$) given that there are clouds ($latex c$): $latex P(r|c)$. It can be proved that it holds that $latex P(X,Y) = P(X|Y)P(Y)$. Note that since $latex P(X,Y) = P(Y,X)$ we also find that $latex P(X,Y) = P(Y,X) = P(Y|X)P(X)$. By combining the formulas, we get $latex P(X|Y) = \frac{P(X,Y)}{P(Y)} = \frac{P(Y|X)P(X)}{P(Y)}$. This is also called Bayes’ rule and it can be viewed as a kind of “inverse” of the original probability.

## Bernoulli distribution

The Bernoulli distribution is one of the easiest probability distribution that exist. In the Bernoulli world, a variables is either “on” or “off”. And being “on” happens with a probability, say $latex p$. An example is tossing a coin. Say we have a fair coin. This mean that the probability of landing on a head is $latex p=\frac{1}{2}$.

Let $latex X$ be a random variable which follows the Bernoulli distribution with probability $latex p$. In shorthand notation, we write $latex X \sim \mbox{Bernoulli}(p)$. Its probability can be calculated as follows: $latex P(X=x)=\theta^x (1 – \theta)^{1-x}$.

We will use this distribution later, but lets first move on to Bayes’ rule. What might be interesting, is this article on spam detection which in fact is an implementation of Bayes’ rule in Python.

## Maximum a Posteriori

Now, inside Bayes’ rule, there are other probability distributions. Recall Bayes’ rule: $latex P(X|Y) = \frac{P(Y|X)P(X)}{P(Y)}$. $latex P(Y)$ is called the evidence, $latex P(X)$ is called the prior. This is information you know on forehand. And $latex P(Y|X)$ is called the likelihood. This will be clear if we replace the letters. Let $latex D$ be the data and let $latex M$ be a model. Then $latex P(M|D)$ means how well the data explains your model. This is called the posterior distribution. Furthermore, we have that $latex P(M|D) = \frac{P(D|M)P(M)}{P(D)}$. Where $latex P(M)$ is the prior. It tells us how probable our model is and is the information we know on forehand. We can even say we have a uniform prior, which means that all models are equally likely. Furthermore, $latex P(D|M)$ is the likelihood and it tells us how likely the data is given our model.

A maximum a posteriori (MAP) estimate are model parameters ($latex M$) such that the posterior is maximized. In other words: $latex M_{MAP} = \arg\!\max\limits_{M}{P(M|D)}$.

Other models for parameter estimation are neural networks. More information about these models is found here.

## Win the lottery

Now suppose we have inside information about the lottery and we know the results of tickets which contained the number “latex 8”. If a result resulted in winning the lottery, we write a $latex 1$ and if a result resulted in not winning the lottery we write a $latex 0$. These are the last $latex 10$ results: $latex \textbf{s} = (0, 1, 1, 0, 0, 1, 0, 1, 1, 1)$. What does this say about the underlying distribution? First notice that the underlying distribution is a Bernoulli distribution (with a parameter $latex \theta$). This is because the results are either “on” ($latex 1$) or “off” ($latex 0$). We want to find the most likely $latex \theta$ given our data $latex \textbf{s}$. Lets assume that each result is independent of any other result.

The first thing to notice is that it holds that $latex P(\textbf{s}|\theta)=\prod\limits_{i=1}^{n} P(s_i | \theta) = \prod\limits_{i=1}^n \theta^{s_i} (1 – \theta)^{1 – s_i}$ by the independence assumption and by noting that each $latex s_i$ is identically distributed.

There are some theoretical results which are not explained in this article, but will be used here. The Beta distribution is the conjugate distribution of the Bernoulli distribution. So that means that the prior and the posterior have the same form. If the prior is Beta, the likelihood is Bernoulli, then the posterior distribution will also be a Beta distribution. The Beta distribution has the following probability density function: $latex P(x|\alpha, \beta) = \frac{1}{B(\alpha, \beta)} x^{\alpha – 1}(1 – x)^{\beta – 1}$ with $latex B(\alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}$.

Back to the problem. Our prior is $latex P(\theta)$. Now we will make the problem first more complex by saying that we have a Beta prior on $latex \theta$. So $latex \theta$ follows a Beta distribution with parameters $latex \alpha$ and $latex \beta$. This means that $latex P(\theta|\alpha, \beta) \sim Beta(\alpha, \beta)$. Now we can write out our posterior:

$latex P(\theta| \textbf{s}, \alpha, \beta) =\frac{P(\textbf{s}|\theta, \alpha, \beta) P(\theta | \alpha, \beta)}{P(\alpha, \beta)}$

$latex \propto P( \theta | \alpha , \beta )P(\textbf{s} | \alpha , \beta )$

$latex = \frac{1}{B( \alpha , \beta )}\theta^{\alpha – 1}(1 – \theta)^{\beta – 1}$

$latex \prod\limits_{i=1}^n \theta^{s_i} (1 – \theta)^{1 – s_i}$

$latex \propto \theta^{\alpha – 1}(1 – \theta)^{\beta – 1}\prod\limits_{i=1}^n \theta^{s_i} (1 – \theta)^{1 – s_i}$

$latex = \theta^{\alpha – 1 + \sum\limits_{i=1}^n s_i} (1 – \theta)^{\beta – 1 + n – \sum\limits_{i=1}^n s_i}$

Notice that the prior is a Beta distribution, the likelihood is a Bernoulli distribution, so the posterior must also be a Beta distribution with parameters $latex \alpha_0 = \alpha + \sum\limits_{i=1}^n s_i$ and $latex \beta_0 = \beta + n – \sum\limits_{i=1}^n s_i$. Now we know that $latex P(\theta | \textbf{s}, \alpha, \beta) = \frac{1}{B(\alpha_0, \beta_0)} \theta^{\alpha_0 – 1} (1 – \theta)^{\beta_0 – 1}$. Finding $latex \theta_{MAP}$ is finding $latex \theta$ such that $latex P(\theta | \textbf{s}, \alpha, \beta )$ is maximized. This is equal to finding $latex \theta$ for which it holds that the derivative of the posterior is $latex 0$. This is equal to finding $latex \theta$ for which it holds that the derivative of the log of the posterior is $latex 0$. In other words: $latex \frac{\partial \log(P(\theta_{MAP} | \textbf{s}, \alpha, \beta))}{\partial \theta_{MAP}} = 0 \Rightarrow$

$latex \frac{\partial}{\partial \theta_{MAP}} (\log(1) – \log(B(\alpha_0, \beta_0) + (\alpha_0 – 1) \log(\theta_{MAP}) + (\beta_0 – 1) \log(1 – \theta_{MAP})) = 0$

Notice that we can trow away constants, since they do not contribute to the value of $latex \theta$. So the last formula implies that:

$latex \frac{\alpha_0 – 1}{\theta_{MAP}} – \frac{\beta_0 – 1}{1 – \theta_{MAP}} = 0$

$latex \Rightarrow \theta_{MAP} = \frac{\alpha_0 – 1}{\alpha_0 + \beta_0 – 2}$

Let say that we have no prior knowledge and we set $latex \alpha=\beta=1$. Now we get $latex \sum\limits_{i=1}^n s_i = 6$ and $latex n = 10$. Therefore we have that $latex \alpha_0 = 1 + 6 = 7$ and $latex \beta_0 = 1 + 10 – 6 = 5$. Now we can calculate $latex \theta_{MAP} = \frac{6}{10} = \frac{3}{5}$. What can we do now? We are interested in winning the lottery. In other words, if $latex \theta_{MAP} > \frac{1}{2}$, the odds are in our favor! We can approximately calculate the cumulative distribution function for $latex \theta_{MAP}>\frac{1}{2}$. We get $latex P(\theta_{MAP} > \frac{1}{2} | \textbf{s}, \alpha, \beta) \approx 64\%$. So, with even this small set of data, we are already more certain than uncertain that we will win the lottery more than half of the time! So we will buy as many as lottery tickets as possible.

## Exercise

Calculate how likely it is that you will win the lottery more than half of the times given the following data: $latex \textbf{s} = (0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1)$. Use a uniform prior (choose $latex \alpha = \beta = 1$).

Hints:

- Calculate $latex n$.
- Calculate $latex \sum\limits_{i=1}^n s_i$.
- Calculate the probability likelihood function.
- Calculate the posterior distribution.
- Approximate the cumulative density function for $latex \theta_{MAP} > \frac{1}{2}$.