Thursday 13 March 2014

Genetic analysis

XKCD's take on genetic analysis.
Genetic analysis may sound complicated, but it relies on very simple principles of heredity, statistics and probabilities. Sometimes the genotypes are not known at all, and we need to look at pedigrees to determine models of heritance. Here some basic principles of genetic analysis are discussed. Later on a post on bioinformatics will make a deeper analysis on how to analyze known genes an genomes.

Calculating probabilities

Inheritance, or which genes each parent passes on to the offspring, is always partially random. This is why probabilities are important. For example if we know the absolute frequency, i.e. the number, of a certain allele in a sample, we can deduce the probability of one randomly picked individual having that particular allele. Set the absolute frequency of the allele a to 118. Now we know that there are 118 a alleles in our sample of 500 chromosomes. The probability of a random individual to have a is simply the amount of a divided by the number of all possible alleles: 118 / 500 = 0,236. This is also the relative frequency of a.

So, the probability of A, whatever A is, is P(A) = the number of favorable results / the number of all possibilities. The probability of A's complement, of A not happening, is 1- P(A). Combinations of two or more independent variables are calculated as follows: P(A and B) = P(A) * P(B) and P(A or B) = P(A) + P(B). An example of a complement: if a chromosome has the allele a, it cannot have the allele A. Only either one or the other (barring some rare genetic mutations, which are not considered here).

The probability of two separate outcomes depends on if the outcomes are related or not. A union means that either (A or B) or (A and B) happen. For independent variables the union would be P(A) + P(B) - P(A u B). For dependent variables the union is zero: if A has happened, B cannot happen, or vice versa.

To ease your stress, here is a cute animal.
The conditional probability, the probability of A happening when we know that B has already happened and A and B are dependent, is (P(A) * P(B)) / P(B). This is often marked as P(A|B).

Permutations and combinations deal with the order of several possibilities. Permutations are used when the order of the events is important, and is counted as n!. Combinations take all orders into consideration, and are counted as n! / (r! (n-r!)).

Binomial probability is a bit more complex. It is used when we want to determine the probability of getting exactly r favorable results, each with the probability of p, out of n repeats. The magic word here is exactly - when that is used in an exercise, think of binomials.

More on probabilities:
For the mathematically gifted: Wikipedia 
Cut the Knot (also covers Bayesian methods)

Statistics

Statistics have been covered before in a post named Variance Analysis (one of the most popular posts in this blog!), so I'll just remind you of the formulae we will need when doing genetic analysis.



There are thousands of helpful websites you can look up for more information on statistics. Here are just a few:

Statistics.com
Mendelian genetics by Phillip McLean

Examples

Right, let's get down to the real deal and do some analysis! The examples are from University lecture materials for course in genetic material, but I unfortunately cannot share the entire material due to copyright restrictions and a language barrier - the materials are not in English :)

Example 1. Parents are heterozygotic concerning their eye color. They both have the allele s for blue eyes and the allele S for brown eyes. Calculate the probability that they will have 
a) a child with blue eyes  b) five children with blue eyes.

From the way the alleles are written we see that S is the dominant allele. The genotypes of the parents are Ss and Ss, so the possible genotypes of their children are 

Now we can see that 3/4 of the offspring have the dominant allele, so only 1/4 has blue eyes (homozygote ss). Therefore the P(a child has blue eyes) is 1/4 = 25 %. The probability for each subsequent child is similar, so P (five of five children have blue eyes) is 0,25 * 0,25 * 0,25 * 0,25 * 0,25 = 0,000977.


Example 2. 32 % of people infected with a rare illness have mutation A, and 16 % have mutation B. 10 % of those infected have both A and B. Calculate the probability of randomly selected person to have at least one of the mutations? 

The selected person must now have either A or B or both. What is needed is the union of A and B: P(A) + P(B) - P(A u B) = 0,32 + 0,16 - 0,1 = 0,38. 38 % have either A, B or both mutations. Note that the union P(A u B) is NOT A*B in this case, but it is given as 10 %.

Example 3. There are 2 boys and 5 girls in a family. In how many different sequences could the children have been born? 

Because the order is not important, we'll use combinations: n over r, i.e. n! / r!(n-r)!. The n now is 7, the number of all kids. The r can be either 2 or 5: both give the same result. If r = 2, then 
7! / 2!(7-2)! = 7! / 2! 5! = 5040 / 240 = 21.

Example 4. Parents are heterozygotes concerning a rare recessive illness. Calculate the probability that out of three children
a) all are healthy
b) two are ill
c) at least two are ill.

a) Recessive heterozygotes produce 25 % of recessive homozygotic alleles (see example 1). The probability for each child to be healthy is thus 1 - 0,25. P(all are healthy) = 0,753 = 0,422.

b) Two out of three must be ill, so we need the binomial distribution. Now r = 2, n = 3 and p = 0,25. The first factorial, n over r, gives 3!/2!(3-2)! = 3. Continuing from there we have 3 * 0,252 * (1-0,25)3-2 =3 * 0,0625 * 0,75 = 0,141.
c) If at least two must be ill, then the probability is P(two are ill) + P(three are ill). The first part is calculated like in part b: 3 * 0,252 * (1-0,25)3-2 =3 * 0,0625 * 0,75 = 0,141. P(three are ill) is simply 0,253 = 0,015625. So P(at least two are ill) = 0,141 + 0,015625 = 0,156. 

Example 5. The penetrance of a certain illness varies between genotypes. The penetrances are 0,01 for AA, 0,05 for Aa and 0,5 for aa. In a population the allele frequencies are  f(a) = 0,05 and f(A) = 0,95. The population is in Hardy-Weinberg equilibrium. Count the prevalence of the illness in the whole population.

Whoa, lots of terms here! Penetrance is the probability of expressing a certain trait. Here the trait is the illness. H-W balance means that in an ideal population, where q and p are the relative frequences of a alleles, there are p2 dominant homozygotes, 2pq heterozygotes and q2 recessive homozygotes. What are they actually askingfor is the prevalence, i.e. the probability of a random member of the population to have the illness.

First we need to calculate the frequencies of the genotypes in the population. With the H-W equilibrium and the given frequencies we know that p = 0,05 and q = 0,95. Now the genotype frequencies are  
AA  = dominant homozygotes = p2 = 0,952 = 0,9025
Aa = heterozygotes = 2pq = 2*0,95*0,05 = 0,095
aa = recessive homozygotes = q2 = 0,052 = 0,0025

Now each genotype has its own probability of actually expressing the illness. To make it easier to understand we can build a table of  a tree of probabilities:


So now the random person we select has a 0,9025 % chance of having the genotype AA, and then 0,01 % chance of being sick.  Note that these variables are independent (a person can only have one genotype and be either sick or healthy). The possibility of expressing the illness is thus P(has a certain genotype) * P(is sick).

P(is sick) = P (AA and sick) + (P Aa and sick) + P(aa and sick) = P(0,9025 * 0,01) + P(0,095*0,05) + P(0,0025*0,5) = 0,009025 + 0,00475 + 0,00125 = 0,015.

Example 6. The observed genotype frequencies in a population are f(AA) = 31, f(Aa) = 89 and f(aa) = 122). Is the population in Hardy-Weinberg equilibrium?

If we were mean about it, we'd say no because no actual population is ever in H-W equilibrium. However in an exam we'd get 0 points for that, so let's calculate this. Now we need to use x2 or khi squared test. We already have the observed frequencies. Now we need the expected frequencies, i.e. the frequencies if the population was in H-W equilibrium.

To get there we calculate the allele frequencies in the population. There are 31 creatures with two A alleles and 89 with one. In total we then have 2 * 31 + 89 = 151 A-alleles. The recessive allele a is calculated similarly: 2*122 + 89 = 333. In total there are 151 + 333 alleles, so the relative frequencies are 151/484 = 0,312 for A and 1-0,312 = 0,688 for a. So now p = 0,312 and q = 0,688.The expected absolute genotype frequencies are now
AA  = p2 = 0,3122 * 242 (the size of population) = 23.56
Aa = 2pq = 2 * 0.312 * 0.688 * 242 = 103.89
aa = q2 = 0.6882 * 242 = 114.55


We need to calculate the chi squared test value using the formula

To make it easier, let's put our values to a table. Then we can use the formula and calculate the x2 test variable.


The value 4.97 is not the answer. Remember the question: is the population in H-W equilibrium? The answer hides in a x2 distribution table. To use that we need degrees of freedom (df), which in an chi squared good of fit test is the number of classes minus 1. Here we have three classes (three genotypes), so our df = 3-2 = 1. Now we look at a x2 distribution table such as this.



With df = 2 we find that our  x2-value, 4.97, goes between 4.605 and 5.991. The corresponding P-values are 0.10 and 0.05. So our p is 0.1 - 0.05. Without going too far into interpreting p-values we can just note that it is higher than 0.05 which means that we abadon the hypothesis 0 (the population is in H-W equilibrium). P > 0,05 shows us that the population is NOT in H-W equilibrium.