Saturday, April 05, 2014

The first base in a codon is usually a purine

In my earlier post on the leprosy bacterium (Mycobacterium leprae), I showed some colorful graphs of DNA base composition, and I showed how the compositional statistics vary non-randomly according to codon base position. If you're not a biogeek, what you need to know here is the following (Bioinformatics in Sixty Seconds):
  • Genes encode information in a linear sequence of the four DNA bases A, G, C, and T (adenine, guanine, cytosine, and thymine). A and G are purines. C and T are pyrimidines.
  • The sequence on one strand of DNA is complementary to the sequence on the opposite strand of DNA. (G pairs with C and A pairs with T, so that if one stand contains the message GGTCA, the opposite strand will contain CCAGT.) The fact that G pairs with C and A pairs with T was figured out by Erwin Chargaff in the early 1950s. Watson and Crick deduced the double helix structure of DNA shortly thereafter.
  • Genes are said to have a message strand and a complementary "transcribed strand." The transcribed strand is used as a template to create RNA with the same sequence as the message strand, except thymine (T) is replaced by uracil (U) in RNA.
  • If a gene encodes a protein (as most genes do), the associated RNA will ultimately be parsed three bases at a time to determine how to encode the protein. The three-base "words" in DNA/RNA are called codons.
  • There are 64 possible codons. They map to 20 amino acids. Some amino acids have as many as six synonymous codons; others have just one.
In my leprosy post, I happened to mention (and show data for) the fact that most codons begin with a purine (A or G). That's true not just for the leprosy bacterium but for almost all genes in all organisms. This fact isn't widely acknowledged in the literature (although Geoffrey Zubay mentions it in passing on page 495 of Origins of Life). The first base in a codon tends to be A or G about 60% of the time. This is true whether the genome is rich in G+C or poor in G+C. (The G+C content of genomes varies widely; it's often used as a taxonomic criterion.)

An example of an organism with very high G+C content in its genome is the common soil bacterium Streptomyces griseus, with 72% G+C. In bacteria, genome G+C content correlates loosely (r=0.46) with genome size, and sure enough, the genome of S. griseus is on the portly side, with 7265 genes encoded in 8.7 million base-pairs of DNA. Just for fun, I created a graph (below) that shows how DNA base content varies according to codon base position. Relative purine content (A+G) is plotted on the y-axis; G+C content is on the x-axis. Each dot represents one gene's statistics. (For every gene, I tallied the average purine usage in all the gene's codons, and the average G+C content, at each codon base position.) Red dots are for the first base in a codon. Gold dots are for base two. Blue dots are for base three.

Base composition by codon position for 7265 genes in S. griseus strain XylebKG-1. Each point represents statistics for one gene. Notice the extremely high G+C content in base 3. Click to enlarge. See text for discussion.

Notice how the blue dots (representing base three) slam up against the right edge of the graph. The data points look as if they're cut off or something, but they're not! The "jammed up" appearance is the result of base three having an extremely high G+C content in Streptomyes, coming very close to the theoretical maximum of 100% in many genes. This is nothing unusual. In organisms with high-GC genomes, the third codon position is usually very, very high in G+C, while in organisms with low-GC genomes, the "wobble base" (as it's sometimes called) is usually extraordinarily rich in A+T. The third codon position is, to a large extent, informationally redundant (due to codon degeneracy) and so the choice of base here usually corresponds to G or C if the genome is characteristically GC-rich, or A or T if the genome is AT-rich.

Unlike position three, the first base in a codon is highly significant, informationally, and is subject to strong selection pressure. Notice how the red dots cluster fairly high on the graph (median: y=0.5957, plus or minus 0.0497 SD). It means most codons begin with a purine (A or G). Don't ask me how or why nature made this choice. The choice is pretty clear, though. Sixty percent of the time, you'll find A or G in position one. In some genes, it's more like 75%; very few are under 50%.

Base two (gold dots) shows a relaxation of G+C preference (median: x=0.5139) and a reduced tendency for purine usage (median: y=0.4475), but you'll notice, if you look closely, that there's a secondary cloud of gold points under the main cloud. This secondary cloud of points has a special significance, which I'll talk about in my next post.

To complement the above graph, I thought it might be fun to do the same sort of analysis on a genome containing conspicuously low G+C content. For this, I chose Clostridium botulinum (Hall strain), which has a genomic G+C content of just 28.2%. The botulinum genome encodes 3404 genes in 3.7 million base-pairs of DNA. Like S. griseus, C. botulinum can be found in soil, but unlike Streptomyces (which is aerobic), members of the Clostridiales are famously anaerobic and find oxygen downright intolerable (not that it matters for this analysis, however).

When we look at base compositions in various codon positions for the 3404 coding regions of C. botulinum DNA, we get the following graph:

Here, notice that the blue dots are way over to the left, reflecting the organism's preference for a very low overall genomic G+C content. Base two (gold points) is not quite as GC-extreme. But notice where the red dots are clustering: They're quite high on the graph, at a median y-value of 0.6956, with very few data points falling below 0.6. The preference for a purine in codon position one is very strong in C. botulinum.

If you look carefully, you can see (once again) a tiny "breakaway group" of gold dots underneath the main cloud of gold (base two) data points. This has a special significance; I'll explain it in my next post.

In sum: Two features of codon base composition are highly general and apply across organisms (and across domains of life). First: The third ("wobble") base has the most extreme G+C content, biased toward high G+C (if the organism has characteristically high genomic GC content) or toward low G+C (if the organism has a low-GC genome). Second: The base in codon position one tends to be a purine in most codons, in most proteins, in most organisms, most of the time. The first rule is easy to explain: low information content and low selection pressure allow an organism to put whatever base is most convenient (whatever's lying around handy, you might say) in the "wobble" position. The second rule is harder to explain. Nature simply prefers a purine in the first position of a codon.

Genomes (from the National Center for Biotechnology Information) were downloaded from on April 1, 2014. Data processing was done with custom scripts (written in JavaScript). Graphs (and statistics) were generated using the excellent service at