## Friday, August 21, 2015

### Probability

How likely are you to read this through? If your answer is a numerical value between 0 and 1, you may skip this post. You already know the material.

Why am I writing about probability? First of all, because I really LOVE probability. If you don't, I hope that by the end of this post you would like it a little bit more. Second, we use probability in everyday life: when we plan an outdoor activity, we estimate the probability of rain. When we make life decisions, we think of the probable consequences, since we can't tell the future... Unfortunately, most people use it wrong, without basic understanding of probability.
And last, I was about to write a post about Machine Translation, and realized that I can't explain anything without first introducing probability. Probability is widely used in NLP, as in many other computer science fields.

Probability is a numerical value between 0 and 1 measuring the likeliness that an event will occur. 0 means that the event will not occur, and 1 means that the event will certainly occur. An intuitive example is tossing a coin; there are two possible outcomes: "heads" and "tails". If the coin is fair, the probability (chance) of each outcome is ½ (50%): P(heads) = P(tails) = ½. A fair coin and the two outcomes of its tossing.

In general, when conducting an experiment, such as tossing a coin, there are several possible outcomes (e.g. { heads, tails }). Every outcome's event (e.g. "the coin landed on heads") has a probability between 0 and 1. Since every experiment must have an outcome, the probability that any of the possible outcomes occurred is 1. In this example, P("heads or tails") = 1. This event represents the entire "probability space".

As you can see from the example above, an event can be composed of several outcomes. Think about another experiment: rolling a die. The possible outcomes are: {1, 2, 3, 4, 5, 6}. The event "the outcome is an odd number" is composed of the outcomes 1, 3, and 5. We can write it as A={1,3,5}. A die with 3 out of 6 possible outcomes showing.

If two events have no common outcomes, they are called disjoint, and the probability that any one of them will occur is the sum of probabilities that each of them will occur. For example, A={1} and B={2}. P(A or B), denoted P(A ∪ B), is the probability that either A or B occurred. We know that a die can only show one number, so A and B can't both occur, and:
P(A ∪ B) = P(A) + P(B).

We already know that the probability of the entire probability space (all possible outcomes) is 1, so: P({1, 2, 3, 4, 5, 6}) = 1. We also know that the events are disjoint, therefore P({1, 2, 3, 4, 5, 6}) equals to the sum of probabilities of all events. If the die is fair (it is not biased towards a certain outcome), then the probability of all outcomes is equal. Therefore, P(1) = ... = P(6) = ⅙. This is called uniform distribution. In most real-world examples, this is not the case, otherwise, probability would have been boring (and probability is fascinating! Really!).

Every event has a complement. For example, the event A="the die shows an odd number"={1, 3, 5} has a complement Ā="the die shows an even number"={2, 4, 6}. The event B={1} has a complement B̄={2, 3, 4, 5, 6}. The complement of an event is all the other possible outcomes. Now you must notice that by definition, "A or Ā" is the entire probability space, and A and Ā are disjoint. Using the two properties we've just discussed:
1) the probability of the entire probability space is 1.
2) the probability that any of disjoint events occurred is the sum of their probabilities.
We can tell that P(A) + P(Ā) = 1. So if you know the probability that it would rain tomorrow P(R), you also know the probability that it won't rain tomorrow: 1 - P(R).

Joint & Conditional Probabilities
We can also discuss the joint probability of events A and B: this is the probability that both events will occur. For example, what is the probability of rolling an even number which is bigger than 2? Let's define two events. A is the event of even outcomes: A = {2, 4, 6}, and B is the event of outcomes larger than 2: B = {3, 4, 5, 6}. Then C is the intersection of A and B, denoted A ∩ B: it contains all the outcomes that are both even (in A) and larger than 2 (in B): C = {4, 6}. Since {4} and {6} are disjoint, and the probability of each outcome is ⅙, then P(C), which is also denoted as the joint probability of A and B, P(A, B) = P({4}) + P({6}) = ⅙ + ⅙ = ⅓.

Say that you know event A occurred, for example, you know that it rains today. Does it change the probability of another event B to occur, for example, that you will be late to work today? The probability of event B to occur, given that event A occurred, is the conditional probability P(B|A) (B given A). If A and B are dependent, this probability is different from the prior probability of B: P(B) (the probability of B, without having knowledge about A). The conditional probability of B given A is the ratio of how likely it is that A and B occur together, given that A has occurred:

(1) P(B|A) = P(A,B) / P(A)

For example, when rolling a die, let A={1,3,5} (odd outcome) and B={4,5,6} (outcome greater than 3). Then P(A,B) = P(A ∩ B) = P({5}) = ⅙. P(A) = ⅙ + ⅙ + ⅙ = ½.
Therefore, P(B|A) = P(A,B) / P(A) =  / ½ = ⅓ < P(B) = P({4, 5, 6}) = ⅙ + ⅙ + ⅙ = ½. So If you know that outcome was odd, the probability that it was greater than 3 has reduced.

On the other hand, some events may be independent. For example, if B is the event of outcomes larger than 2: B = {3, 4, 5, 6}, and A remains the same, then P(A,B) = P(A ∩ B) = P({3,5}) = ⅙ + ⅙ = . P(A) remains ½, and P(B|A) = P(A,B) / P(A) =  / ½ = ⅔ = P(B) = P({3, 4, 5, 6}) = ⅙ + ⅙ + ⅙ + ⅙ = . So knowing that the outcome was odd doesn't affect the chances the the outcome is greater than 2, and A and B are independent.

If two events A and B are independent, then P(B|A) = P(B), P(A|B) = P(A), and using equation (1) we get that P(A,B) = P(A)P(B). So if you know that two events are independent, and you want to know the probability that both of them will occur, you need to multiply the probabilities that each of them will occur. For example, what is the probability that a die will have an odd outcome (A) and a coin will show heads (B)? Intuitively, these two experiments are independent, so P(A) = ½, P(B) = ½, and P(A,B) = ½ * ½ = ¼. But don't trust your intuition, and always make sure that these events are really independent. Sometimes two events seem independent, while they are actually not (as in the butterfly effect). If this butterfly flapped his wings yesterday, what are the chances I will be late to work next week?

Bayes Rule
Using equation (1) we get that P(A,B) = P(A) * P(B|A), and this gives us Bayes Rule:

(2) P(A|B) = P(A) * P(B|A) / P(B)

This can be useful in some cases when you know the conditional probability in one direction and would like to compute the other. For example, let's say that there is a clinical test that should diagnose a specific illness. This test is not a 100% accurate: if a person is ill, it will come out positive in 98% of the times. If the person is healthy, it will come out positive in 1% of the times. The ratio of ill people in the population is 2%. Say that someone took this test, and it came out positive. Does it necessarily mean that he is sick? No. It has some probability, that we can compute.

A is the event that a person has this illness. B is the event that the test came out positive. We know that P(B|A)=0.98 (the probability that the test came out positive for an ill person). We also know that P(A)=0.02 (the probability to suffer from this illness). We would like to compute the probability P(A|B). We can use equation (2), but we need to know P(B) - the probability that the test came out positive.

We can use the law of total probability according to which:

(3) P(B) = P(A,B) + P(Ā,B) = P(B|A) P(A) + P(B|) P()

In this example, what is the probability that the test came out positive? There are two cases, one in which the person is ill, and another in which he is healthy. These events are disjoint (because a person is either ill or healthy but never both).

So we get that P(B) P(B|A) P(A) + P(B|) P() = 0.98*0.02 + 0.01*(1-0.02) = 0.0294, and using Bayes rule, P(A|B) = P(A) * P(B|A) / P(B) = 0.02*0.98 / 0.0294  ⅔. Since the test is not very accurate, and the illness is so rare, if someone is tested positive for this illness, there is a probability of ⅓ that he is actually healthy!

The Chain Rule
We've seen that P(A,B) = P(A) * P(B|A). Sometimes it is useful in this direction. This is called the chain rule, and it can be extended to more than two events. In some cases, we would like to compute the probability of multiple events, rather than just one or two. For example, say that you know the probabilities of private names in a certain country. When a child is born, he has a certain probability to be named John (P(John)) and other probabilities for other names. If you know the names of his older siblings, this may affect the probability of his name; if his older sibling is called John, it reduces the probability that he will also be named John. And if his sister's name is Ablah, then he is more likely to be named Mohammad than David. If you want to compute the probability of the names of all children in the family, for example P(John, Jane, David), you can use the chain rule. You will need to know the prior probability of the name John (what is the probability that a kid is called John, if you don't have any knowledge about his siblings, or if he is the first child). Then, you will need to know the probability of a girl being called Jane, given that her brother's name is John. Last, you will need to know the probability of a boy being called David, given that he has two siblings named John and Jane. In general, the probability that events A1,A2,...,Aoccurred is the multiplication of the probability of each event, given that the previous events occurred:

(4) P(A1,A2,...,An) = P(A1) P(A2|A1) ... P(An|A1A2,...,An-1)

This, again, is called the chain rule.

In some cases, we can make a Markov assumption that the probability of an event depends only on the preceding k events (for some fixed number k). For example, if a family has 5 children, the probability of their names will be:
P(A1,A2,A3,A4,A5) = P(A1) P(A2|A1P(A3|A1A2) P(A4|A1A2A3) P(A5|A1A2A3A4)
But if we assume that a child's name only depends on his two immediate older siblings' names, then we get:
P(A1,A2,A3,A4,A5) = P(A1) P(A2|A1P(A3|A1A2) P(A4|A2A3) P(A5|A3A4)
which is easier to compute.

Approximation
Let's return to the names example. What if you don't know the probability function of names, but you do have access to the list of all given names in a certain time and country? You can estimate (approximate) the probability. One simple way of doing this is by counting. This method is called Maximum Likelihood Estimation (MLE). If you want to know what's the probability of a child being named John, check what's the ratio of people called John in the entire population, so that: P*(John) = #John/N, where N is the number of people in the list you have. Since this is not a real probability but an approximation, it is denoted by P* and not by P.

If you want to know the probability of someone being called Jane given that her immediate older sibling's name is John, count all the pairs of John followed by Jane and divide by all pairs of John followed by any name:  P*(Jane|John) = #(John,Jane)/#(John,*).

There are more complex methods to approximate a probability function, but I think that's enough for one post.

So, given that you are reading this sentence now, what is the probability that you read the entire post?