blogorrhea: entropy

Showing posts with label entropy. Show all posts

Saturday, May 25, 2013

DNA: Full of Surprises

DNA is full of surprises, one of them being the radically different ways in which it can be used to express information. We think of DNA as a four-letter language (A,T,G,C), but some organisms choose to "speak" mostly G and C. Others avoid G and C, preferring instead to "speak" A and T. The question is, if DNA is fundamentally a four-letter language, why would some organisms want to limit themselves to dialects that use mostly just two letters?

The DNA of Clostridium botulinum (the botulism bug; a common soil inhabitant) is extraordinarily deficient in G and C: over 70% of its DNA is A and T. The soil bacterium Anaeromyxobacter dehalogenans, on the other hand, has DNA that's 74% G and C. Think of the constraints this puts on a coding system. Imagine that you want to store data using a four-letter alphabet, but you are required to use two of the four letters 74% of the time! Suddenly a two-bit-per-symbol encoding scheme (a four-letter code) starts to look and feel a lot more like a one-bit-per-symbol (two-letter) scheme.

What kinds of information are actually stored in DNA? Several kinds, but bottom line, DNA is primarily a system for specifying sequences of amino acids. The information is stored as three-letter "words" (GCA, ATG, TCG, etc.) called codons. There are 64 possible length-3 words in a system that uses a 4-letter alphabet. Fortunately, there are only 20 amino acids. I say "fortunately," because imagine if there were 64 different amino acids (as there might be in extra-terrestrial life, say) and they had to occur in roughly equal amounts in all proteins. Every possible codon would have to be used (in roughly equal numbers) and there would be no possibility of an organism like C. botulinum developing a "preference" for A or T in its DNA. It is precisely because only 20 codons out of a possible 64 need be used that organisms like C. botulinum (with a huge imbalance of AT vs. GC in its DNA) can exist.

As it happens, all organisms do tend to use all 64 possible codons, but they use them with vastly varying frequencies, giving rise to codon "dialects." (Note that the mapping of 64 codons onto 20 amino acids means some codons are necessarily synonymous. For example, there are four different codons for glycine and six for leucine.) You might expect that an organism like C. botulinum with mostly A and T in its DNA would "speak" in A- and T-rich codons. And you'd be right. Here's a chart showing which codons C. botulinum actually uses, and at what frequencies:

The green-highlighted codons are the ones C. botulinum uses preferentially (with the usage frequencies shown as precentages). As you can see, the most-often-used codons tend to contain a lot of A and/or T. Which is exactly what you'd expect, given that the organism's DNA is 72% A and T.

In theory, a 3-letter word in a 4-letter language can store six bits of information. But we know from information theory that the actual information content of a word depends on how often it's used. If I send you a 100-word e-mail that contains the question "Why?" repeated 100 times, you're not really receiving the same amount of information as would be in a 100-word e-mail that contains text in which no word appears twice.

The average information content of a C. botulinum codon is easily calculated using the usage-frequencies shown above. (All you do is calculate -F * log₂(F) for each codon and add up the results.) If you do the math, you find that C. botulinum uses an average of 5.217 bits per codon, about 13% short of the theoretical six bits available.

One might imagine that the more GC/AT-imbalanced an organism's DNA is, the more biased its codon preferences will be. This is exactly what we find if we plot codon entropy against genome G+C content for a range of organisms having DNA of various G+C contents.

Average codon entropy versus genome G+C content for 90 microorganisms.

In the above graph, you can see that when an organism's DNA is composed of equal amounts of the bases (G+C = 50%, A+T = 50%), the organism tends to use all codons more or less equally, and entropy approaches the theoretical limit of six bits per codon. But when an organism develops a particular "dialect" (of GC-rich DNA, or AT-rich DNA), it starts using a smaller and smaller codon vocabulary more and more intensively. This is what causes the curve to fall off sharply on either side of the graph.

If you have an observant eye, you may have noticed that the two halves of the graph are not symmetrical, even though they look symmetrical at first glance. (Organisms on the high-GC side are using slightly less entropy per codon than low-GC organisms, for a given amount of genome GC/AT skew.) If you're a biologist, you might want to think about why this is so. I'll return to the subject in a future post.

Friday, May 24, 2013

Decrypting DNA

In a previous post ("Information Theory in Three Minutes"), I hinted at the power of information theory to gage redundancy in a language. A fundamental finding of information theory is that when a language uses symbols in such a way that some symbols appear more often than others (for example when vowels turn up more often than consonants, in English), it's a tipoff to redundancy.

DNA is a language with many hidden redundancies. It's a four-letter language, with symbol choices of A, G, C, and T (adenine, guanine, cytosine, and thymine), which means any given symbol should be able to convey two bits' worth of information, since log₂(4) is two. But it turns out, different organisms speak different "dialects" of this language. Some organisms use G and C twice as often as A and T, which (if you do the math) means each symbol is actually carrying a maximum of 1.837 bits (not 2 bits) of information.

Consider how an alien visitor to earth might be able to use information theory to figure out terrestrial molecular biology.

The first thing an alien visitor might notice is that there are four "symbols" in DNA (A, G, C, T).

By analyzing the frequencies of various naturally occurring combinations of these letters, the alien would quickly determine that the natural "word length" of DNA is three.

There are 64 possible 3-letter words that can be spelled with a 4-letter alphabet. So in theory, a 3-letter "word" in DNA should convey 6 bits worth of information (since 2 to the 6th power is 64). But an alien would look at many samples of earthly DNA, from many creatures, and do a summation of -F * log₂(F) for every 3-letter "word" used by a given creature's DNA (where F is simply the frequency of usage of the 3-letter combo). From this sort of analysis, the alien would find that even though 64 different codons (3-letter words) are, in fact, being used in earthly DNA, in actuality the entropy per codon in some cases is as little as 4.524 bits. (Or at least, it approaches that value asymptotically.)

Since 2 to the 4.524 power is 23, and since proteins (the predominant macromolecule in earthly biology) are made of amino acids, a canny alien would surmise that there must be around 23 different amino acids; and earthly DNA is a language for mapping 3-letters words to those 23 amino acids.

As it turns out, the genetic code does use 3-letter "words" (codons) to specify amino acids, but there are 20 amino acids (not 23), with 3 "stop codons" reserved for telling the cell's protein-making machinery "this is the end of this protein; stop here."

E. coli codon usage.

The above chart shows the actual codon usage pattern for E. coli. Note that all organisms use the same 3-letter codes for the same amino acids, and most organisms use all 64 possible codons, but the codons are used with vastly unequal frequencies. If you look in the upper right corner of the above chart, for example, you'll see that E. coli uses CTG (one of the six codons for Leucine) far more often than CTA (another codon for Leucine). One of the open questions in biology is why organisms favor certain synonymous codons over others (a phenomenon called codon usage bias).

While DNA's 6-bit codon bandwidth permits 64 different codons, and while organisms do generally make use of all 64 codons, the uneven usage pattern means fewer than 6 bits of information are used per codon. To get the actual codon entropy, all you have to do is take each usage frequency and calculate -F * log₂(F) for each codon, then sum. If you do that for E. coli, you get 5.679 bits per codon. As it happens, E. coli actually does make use of almost all the available bandwidth (of 6 bits) in its codons. This turns out not to be true for all organisms, however.

Saturday, May 18, 2013

Information Theory in Three Minutes

Claude Shannon, the father of information theory, used to play an interesting game at cocktail parties. He'd grab a book, open it to a random page, and cover up all but the first letter on the page, then ask someone to guess the next letter. If the person couldn't guess, he'd uncover the letter, then ask the person to guess the next letter. (Suppose the first two letters are 'th'. A reasonable guess for the next letter might be 'e'.) Shannon would continue in this manner, keeping score, until a good deal of text had been guessed. The further along one goes in this game, the easier it becomes (of course) to guess downstream letters, because the upstream letters provide valuable context.

What Shannon consistently found from experiments of this sort is that well over half of English letters are redundant, because they can be guessed in advance. In fact, Shannon found that when all forms of redundancy are taken into account, English is more than 75% redundant, with the average information content of a letter being approximately one bit per symbol. (Yes, one bit. See Shannon's "Prediction and Entropy of Printed English.")

Claude Shannon

Shannon became intrigued by questions involving the efficiency of information transfer. What is the nature of redundancy in an information stream? Are some encodings more redundant than others? How can you quantify the redundancy? Eventually, Shannon elaborated a mathematical theory around the encoding and decoding of messages. That theory has since become extremely important for understanding questions of encryption, compression, detection of faint signals in the presence of noise, recovery of damaged signals, and so on.

A central concept in Shannon's theory is that of entropy. "Shannon entropy" is very widely misunderstood and/or misinterpreted, so it's important to be clear on what it's not. It's not disorder: Entropy, in information theory, is not the same as entropy in thermodynamics, even though the mathematics are similar. Shannon liked to consider entropy a statistical parameter reflecting the amount of information (or resolved uncertainty) encoded, on average, by a symbol. We think of the English alphabet as having 26 symbols. Since 26 values can be encoded in log₂(26) == 4.7 bits, we say that the channel bandwidth for 26-letter English is 4.7 bits per symbol, but this is not the entropy. Shannon found that the entropy (the actual bits used per symbol) was closer to 1.0 than to 4.7. How can this be? The answer has to do with the fact that some symbols are used far more often than others; and also (as noted), some symbols are redundant by virtue of context.

Entropy gets to the actual (rather than ideal) information content of a message by taking into account actual frequencies of usage of symbols. If English text used all letters of the alphabet equally (and unpredictably), then the entropy of text would be exactly 4.7 bits per symbol. Each symbol would contribute 1/26th of -log₂(1/26) to the total. But because some letters are used more or less frequently than others, they contribute more or less than 1/26th of log₂(1/26), and that total can add up to less than 4.7.

It's easy to visualize this with a simple example involving coin-tossing. Suppose, for sake of example, that a series of coin tosses comprises a message. As a medium of communication, the coin toss is capable of expressing only two states: heads, or tails. This could be represented in binary form as 1 and 0. If half of all tosses are heads and half are tails, then the total entropy in the message is 0.5 * log₂(0.5) for heads plus 0.5 * log₂(0.5) for tails, or one bit per symbol (Note: If you actually do the math you'll come up with a negative-1. Hence, in entropy calculations, the result is usually multiplied by -1 so it can be expressed as a positive number.)

Consider now the situation of a two-headed coin. In this case, there is no "tails" term and the heads term is 1.0 * log₂(1.0), or zero. This means the tossing of a two-headed coin resolves no uncertainty and carries no information.

Continuing the example, consider the case of a weighted penny that falls heads-up two-thirds of the time. Intuitively, we know that this kind of coin toss can't possibly convey as much information as a "fair" coin toss. And indeed, if we calculate 2/3 * log₂(2/3) for heads plus 1/3 * log₂(1/3) for tails, we get an entropy value of 0.9183 bits per symbol, which means that each toss is (on average) 1.0 - 0.9183 == .0817 or 8.17% redundant. If one were to take a large number of coin tosses involving the weighted penny and convert those tosses into symbols ('h' for heads and 't' for tails, say), the resulting data stream would be compressible to 91.83% of its fully expanded size, and then it wouldn't compress any more beyond that, because that's the entropy limit.

Actually, that last statement needs to be qualified. We're assuming, throughout this example, that the result of any given coin toss does not depend on the outcome of the preceding toss. If that rule is violated, then the true entropy of the "message" could be much lower than 0.9183 bits per symbol. For example, suppose the result of 12 successive coin-tosses was: h-h-t-h-h-t-h-h-t-h-h-t. There's a recurring pattern, and the pattern makes the stream predictable. Predictability reduces entropy; remember Shannon's cocktail-party experiment. (You might ask yourself what a message with all possible redundancy removed would look like, and in what way or ways, if any, it would differ from apparent randomness.)

Technically speaking, when symbols represent independent choices (not depending on what came before), the entropy can be calculated as before, and it's called the order-zero entropy. But if any given symbol depends on the value of the immediately preceding symbol, we have to distinguish between order-zero and order-one entropy. There are also order-two, order-three, and higher-order entropies, representing contexts of contexts.

Suppose now I tell you that an organism's DNA can contain only two types of base-pairs: GC and AT. (You should be thinking "coin toss.") Suppose, further, I tell you that a particular organism's DNA is 70% GC. Disregarding higher-order entropy, does the DNA contain redundancy? If so, how much? Answer: 0.7 * log₂(0.7) for GC plus 0.3 * log₂(0.3) for AT equals 0.8813, meaning redundancy is about 12%. Could the actual redundancy be higher? Yes. It depends what kinds of recurring patterns exist in the actual sequence of A, G, C, and T values. There might be recurring motifs of many kinds. Each would send entropy lower.

Further Reading
Shannon's best-known paper, "A Mathematical Theory of Communication," Bell Systems Tech. Journal, October 1948
"A Symbolical Analysis of Relay and Switching Circuits," Shannon's unpublished master's thesis
Claude Shannon's contribution to computer chess
Shannon-Fano coding
Nyquist-Shannon Sampling Theorem