In this article, we will try to win a (fictive) lottery called Win-Win. In fact, you will learn how to do a MAP (maximum a posteriori) estimation on a Bernoulli distribution. Don’t worry if you don’t know all these words, everything will be explained. If you already know some of the terms, then you can skip these parts.
Basic probability theory
The first thing we need to know is how to calculate with uncertainty. Now a variable is assigned an extra property, namely its uncertainty. Suppose is such a random variable and lets say we are for (read: ) certain that it equals one. Then we write . There are certain axioms (rules) that are always true. Note that the maximum certainty is and the minimum certainty is . In other words: (this is sloppy notation, but it explains the main concept).
Suppose there are two random variables and . Then we can talk about them both being true: . This is called the joint probability. Furthermore, there is a conditional probability: (read: the probability of given ). For example, you can ask what is the probability of rain () given that there are clouds (): . It can be proved that it holds that . Note that since we also find that . By combining the formulas, we get . This is also called Bayes’ rule and it can be viewed as a kind of “inverse” of the original probability.
The Bernoulli distribution is one of the easiest probability distribution that exist. In the Bernoulli world, a variables is either “on” or “off”. And being “on” happens with a probability, say . An example is tossing a coin. Say we have a fair coin. This mean that the probability of landing on a head is .
Let be a random variable which follows the Bernoulli distribution with probability . In shorthand notation, we write . Its probability can be calculated as follows: .
We will use this distribution later, but lets first move on to Bayes’ rule. What might be interesting, is this article on spam detection which in fact is an implementation of Bayes’ rule in Python.
Maximum a Posteriori
Now, inside Bayes’ rule, there are other probability distributions. Recall Bayes’ rule: . is called the evidence, is called the prior. This is information you know on forehand. And is called the likelihood. This will be clear if we replace the letters. Let be the data and let be a model. Then means how well the data explains your model. This is called the posterior distribution. Furthermore, we have that . Where is the prior. It tells us how probable our model is and is the information we know on forehand. We can even say we have a uniform prior, which means that all models are equally likely. Furthermore, is the likelihood and it tells us how likely the data is given our model.
A maximum a posteriori (MAP) estimate are model parameters () such that the posterior is maximized. In other words: .
Other models for parameter estimation are neural networks. More information about these models is found here.
Win the lottery
Now suppose we have inside information about the lottery and we know the results of tickets which contained the number “latex 8”. If a result resulted in winning the lottery, we write a and if a result resulted in not winning the lottery we write a . These are the last results: . What does this say about the underlying distribution? First notice that the underlying distribution is a Bernoulli distribution (with a parameter ). This is because the results are either “on” () or “off” (). We want to find the most likely given our data . Lets assume that each result is independent of any other result.
The first thing to notice is that it holds that by the independence assumption and by noting that each is identically distributed.
There are some theoretical results which are not explained in this article, but will be used here. The Beta distribution is the conjugate distribution of the Bernoulli distribution. So that means that the prior and the posterior have the same form. If the prior is Beta, the likelihood is Bernoulli, then the posterior distribution will also be a Beta distribution. The Beta distribution has the following probability density function: with .
Back to the problem. Our prior is . Now we will make the problem first more complex by saying that we have a Beta prior on . So follows a Beta distribution with parameters and . This means that . Now we can write out our posterior:
Notice that the prior is a Beta distribution, the likelihood is a Bernoulli distribution, so the posterior must also be a Beta distribution with parameters and . Now we know that . Finding is finding such that is maximized. This is equal to finding for which it holds that the derivative of the posterior is . This is equal to finding for which it holds that the derivative of the log of the posterior is . In other words:
Notice that we can trow away constants, since they do not contribute to the value of . So the last formula implies that:
Let say that we have no prior knowledge and we set . Now we get and . Therefore we have that and . Now we can calculate . What can we do now? We are interested in winning the lottery. In other words, if , the odds are in our favor! We can approximately calculate the cumulative distribution function for . We get . So, with even this small set of data, we are already more certain than uncertain that we will win the lottery more than half of the time! So we will buy as many as lottery tickets as possible.
Calculate how likely it is that you will win the lottery more than half of the times given the following data: . Use a uniform prior (choose ).
- Calculate .
- Calculate .
- Calculate the probability likelihood function.
- Calculate the posterior distribution.
- Approximate the cumulative density function for .