Saturday, May 25, 2013

DNA: Full of Surprises

DNA is full of surprises, one of them being the radically different ways in which it can be used to express information. We think of DNA as a four-letter language (A,T,G,C), but some organisms choose to "speak" mostly G and C. Others avoid G and C, preferring instead to "speak" A and T. The question is, if DNA is fundamentally a four-letter language, why would some organisms want to limit themselves to dialects that use mostly just two letters?

The DNA of Clostridium botulinum (the botulism bug; a common soil inhabitant) is extraordinarily deficient in G and C: over 70% of its DNA is A and T. The soil bacterium Anaeromyxobacter dehalogenans, on the other hand, has DNA that's 74% G and C. Think of the constraints this puts on a coding system. Imagine that you want to store data using a four-letter alphabet, but you are required to use two of the four letters 74% of the time! Suddenly a two-bit-per-symbol encoding scheme (a four-letter code) starts to look and feel a lot more like a one-bit-per-symbol (two-letter) scheme.

What kinds of information are actually stored in DNA? Several kinds, but bottom line, DNA is primarily a system for specifying sequences of amino acids. The information is stored as three-letter "words" (GCA, ATG, TCG, etc.) called codons. There are 64 possible length-3 words in a system that uses a 4-letter alphabet. Fortunately, there are only 20 amino acids. I say "fortunately," because imagine if there were 64 different amino acids (as there might be in extra-terrestrial life, say) and they had to occur in roughly equal amounts in all proteins. Every possible codon would have to be used (in roughly equal numbers) and there would be no possibility of an organism like C. botulinum developing a "preference" for A or T in its DNA. It is precisely because only 20 codons out of a possible 64 need be used that organisms like C. botulinum (with a huge imbalance of AT vs. GC in its DNA) can exist.

As it happens, all organisms do tend to use all 64 possible codons, but they use them with vastly varying frequencies, giving rise to codon "dialects." (Note that the mapping of 64 codons onto 20 amino acids means some codons are necessarily synonymous. For example, there are four different codons for glycine and six for leucine.) You might expect that an organism like C. botulinum with mostly A and T in its DNA would "speak" in A- and T-rich codons. And you'd be right. Here's a chart showing which codons C. botulinum actually uses, and at what frequencies:


The green-highlighted codons are the ones C. botulinum uses preferentially (with the usage frequencies shown as precentages). As you can see, the most-often-used codons tend to contain a lot of A and/or T. Which is exactly what you'd expect, given that the organism's DNA is 72% A and T.

In theory, a 3-letter word in a 4-letter language can store six bits of information. But we know from information theory that the actual information content of a word depends on how often it's used. If I send you a 100-word e-mail that contains the question "Why?" repeated 100 times, you're not really receiving the same amount of information as would be in a 100-word e-mail that contains text in which no word appears twice.

The average information content of a C. botulinum codon is easily calculated using the usage-frequencies shown above. (All you do is calculate -F * log2(F) for each codon and add up the results.) If you do the math, you find that C. botulinum uses an average of 5.217 bits per codon, about 13% short of the theoretical six bits available.

One might imagine that the more GC/AT-imbalanced an organism's DNA is, the more biased its codon preferences will be. This is exactly what we find if we plot codon entropy against genome G+C content for a range of organisms having DNA of various G+C contents.

Average codon entropy versus genome G+C content for 90 microorganisms.
In the above graph, you can see that when an organism's DNA is composed of equal amounts of the bases (G+C = 50%, A+T = 50%), the organism tends to use all codons more or less equally, and entropy approaches the theoretical limit of six bits per codon. But when an organism develops a particular "dialect" (of GC-rich DNA, or AT-rich DNA), it starts using a smaller and smaller codon vocabulary more and more intensively. This is what causes the curve to fall off sharply on either side of the graph.

If you have an observant eye, you may have noticed that the two halves of the graph are not symmetrical, even though they look symmetrical at first glance. (Organisms on the high-GC side are using slightly less entropy per codon than low-GC organisms, for a given amount of genome GC/AT skew.) If you're a biologist, you might want to think about why this is so. I'll return to the subject in a future post.