Saturday, April 05, 2014

The first base in a codon is usually a purine

In my earlier post on the leprosy bacterium (Mycobacterium leprae), I showed some colorful graphs of DNA base composition, and I showed how the compositional statistics vary non-randomly according to codon base position. If you're not a biogeek, what you need to know here is the following (Bioinformatics in Sixty Seconds):
  • Genes encode information in a linear sequence of the four DNA bases A, G, C, and T (adenine, guanine, cytosine, and thymine). A and G are purines. C and T are pyrimidines.
  • The sequence on one strand of DNA is complementary to the sequence on the opposite strand of DNA. (G pairs with C and A pairs with T, so that if one stand contains the message GGTCA, the opposite strand will contain CCAGT.) The fact that G pairs with C and A pairs with T was figured out by Erwin Chargaff in the early 1950s. Watson and Crick deduced the double helix structure of DNA shortly thereafter.
  • Genes are said to have a message strand and a complementary "transcribed strand." The transcribed strand is used as a template to create RNA with the same sequence as the message strand, except thymine (T) is replaced by uracil (U) in RNA.
  • If a gene encodes a protein (as most genes do), the associated RNA will ultimately be parsed three bases at a time to determine how to encode the protein. The three-base "words" in DNA/RNA are called codons.
  • There are 64 possible codons. They map to 20 amino acids. Some amino acids have as many as six synonymous codons; others have just one.
In my leprosy post, I happened to mention (and show data for) the fact that most codons begin with a purine (A or G). That's true not just for the leprosy bacterium but for almost all genes in all organisms. This fact isn't widely acknowledged in the literature (although Geoffrey Zubay mentions it in passing on page 495 of Origins of Life). The first base in a codon tends to be A or G about 60% of the time. This is true whether the genome is rich in G+C or poor in G+C. (The G+C content of genomes varies widely; it's often used as a taxonomic criterion.)

An example of an organism with very high G+C content in its genome is the common soil bacterium Streptomyces griseus, with 72% G+C. In bacteria, genome G+C content correlates loosely (r=0.46) with genome size, and sure enough, the genome of S. griseus is on the portly side, with 7265 genes encoded in 8.7 million base-pairs of DNA. Just for fun, I created a graph (below) that shows how DNA base content varies according to codon base position. Relative purine content (A+G) is plotted on the y-axis; G+C content is on the x-axis. Each dot represents one gene's statistics. (For every gene, I tallied the average purine usage in all the gene's codons, and the average G+C content, at each codon base position.) Red dots are for the first base in a codon. Gold dots are for base two. Blue dots are for base three.

Base composition by codon position for 7265 genes in S. griseus strain XylebKG-1. Each point represents statistics for one gene. Notice the extremely high G+C content in base 3. Click to enlarge. See text for discussion.

Notice how the blue dots (representing base three) slam up against the right edge of the graph. The data points look as if they're cut off or something, but they're not! The "jammed up" appearance is the result of base three having an extremely high G+C content in Streptomyes, coming very close to the theoretical maximum of 100% in many genes. This is nothing unusual. In organisms with high-GC genomes, the third codon position is usually very, very high in G+C, while in organisms with low-GC genomes, the "wobble base" (as it's sometimes called) is usually extraordinarily rich in A+T. The third codon position is, to a large extent, informationally redundant (due to codon degeneracy) and so the choice of base here usually corresponds to G or C if the genome is characteristically GC-rich, or A or T if the genome is AT-rich.

Unlike position three, the first base in a codon is highly significant, informationally, and is subject to strong selection pressure. Notice how the red dots cluster fairly high on the graph (median: y=0.5957, plus or minus 0.0497 SD). It means most codons begin with a purine (A or G). Don't ask me how or why nature made this choice. The choice is pretty clear, though. Sixty percent of the time, you'll find A or G in position one. In some genes, it's more like 75%; very few are under 50%.

Base two (gold dots) shows a relaxation of G+C preference (median: x=0.5139) and a reduced tendency for purine usage (median: y=0.4475), but you'll notice, if you look closely, that there's a secondary cloud of gold points under the main cloud. This secondary cloud of points has a special significance, which I'll talk about in my next post.

To complement the above graph, I thought it might be fun to do the same sort of analysis on a genome containing conspicuously low G+C content. For this, I chose Clostridium botulinum (Hall strain), which has a genomic G+C content of just 28.2%. The botulinum genome encodes 3404 genes in 3.7 million base-pairs of DNA. Like S. griseus, C. botulinum can be found in soil, but unlike Streptomyces (which is aerobic), members of the Clostridiales are famously anaerobic and find oxygen downright intolerable (not that it matters for this analysis, however).

When we look at base compositions in various codon positions for the 3404 coding regions of C. botulinum DNA, we get the following graph:

Here, notice that the blue dots are way over to the left, reflecting the organism's preference for a very low overall genomic G+C content. Base two (gold points) is not quite as GC-extreme. But notice where the red dots are clustering: They're quite high on the graph, at a median y-value of 0.6956, with very few data points falling below 0.6. The preference for a purine in codon position one is very strong in C. botulinum.

If you look carefully, you can see (once again) a tiny "breakaway group" of gold dots underneath the main cloud of gold (base two) data points. This has a special significance; I'll explain it in my next post.

In sum: Two features of codon base composition are highly general and apply across organisms (and across domains of life). First: The third ("wobble") base has the most extreme G+C content, biased toward high G+C (if the organism has characteristically high genomic GC content) or toward low G+C (if the organism has a low-GC genome). Second: The base in codon position one tends to be a purine in most codons, in most proteins, in most organisms, most of the time. The first rule is easy to explain: low information content and low selection pressure allow an organism to put whatever base is most convenient (whatever's lying around handy, you might say) in the "wobble" position. The second rule is harder to explain. Nature simply prefers a purine in the first position of a codon.

Genomes (from the National Center for Biotechnology Information) were downloaded from on April 1, 2014. Data processing was done with custom scripts (written in JavaScript). Graphs (and statistics) were generated using the excellent service at


  1. Our company has well equipped machines and modern tools that ensure quality and fast production with the help of which we are able to manufacture more than 5,000 office chairs per year. And thus, we are able to meet the ever- growing demand of chairs. Among the many places where our products are supplied among those two biggest supplies are done in Mumbai, Pune in India.
    Visitor Chair Supplier in Mumbai
    Chair Dealers in Mumbai
    Top Chair Manufacturers in Mumbai
    Best Chair Manufacturers in Mumbai

  2. Welcome to Indian Packers and Movers in Mumbai, We have a realistic team of professionals in Moving & Packing, Loading & Unloading and Transportation of your expensive goods with care.
    Movers and Packers in Mumbai
    Packers and Movers in Grant Road
    Packers and Movers in Mumbai Central
    Packers and Movers in Mahalaxmi
    Packers and Movers in Prabhadevi

  3. Welcome to Indian Packers and Movers in Mumbai, We have a realistic team of professionals in Moving & Packing, Loading & Unloading and Transportation of your expensive goods with care. Packers and Movers in Mumbai goes behind the intercontinental principles which give surety that the goods are relocates carefully without any smash up. We at, Packers and Movers in Mumbai has experienced employees to take care of your goods in Packing and Moving Services in Mumbai.
    Please Visit Our Website :
    Movers and Packers in Panvel
    Packers and Movers in Kamothe
    Movers and Packers in Vashi
    Movers and Packers in Chembur

  4. Availing the services of the best movers packers Hyderabad prove to be beneficial in terms of expenses as well as ease of moving. They bring the packing material and equipment necessary to move your items safely. The Packers and movers Hyderabad charges are affordable and you get reliable services. The amount of effort, time and money you would have to spend if you were relocating all by yourself would be a lot higher.
    Movers and Packers Hyderabad
    Movers and Packers in Kondapur
    Movers and Packers in Gachibowli
    Movers and Packers in Kukatpally

  5. Many packers and movers Hyderabad, Telangana may boast about offering a huge list of services with little to no skill or expertise to carry out those services as required. You need to check with every packer and mover if they have the experience to fulfil your specific requirements properly, and will complete your relocation in the most cost-effective way possible.
    Packers and Movers in Chanda Nagar
    Packers and Movers in Nallagandla
    Packers and Movers in Kukatpally
    Packers and Movers in Miyapur
    Packers and Movers in Sainikpuri

  6. This is an extremely well written article. I will be sure to bookmark it.such a valuable post. 비트코인카지노

  7. Thank you for posting the very nice informative article. Looking forward to reading more articles from you. 한국야동

    Also do visit may web page check this link 야설

  8. I admire this article for the well-researched content and excellent wording. I got so involved in this material that I couldn’t stop reading. I am impressed with your work and skill. Thank you so much 한국야동

    Also do visit may web page check this link 야설

  9. Really satisfied with all the information I have found in this article. It gives immense knowledge on physical education, it is very helpful and quite generous to spread a good message. 야동

    Also do visit may web page check this link 국산야동

  10. I got a web site from where I be capable of really obtain valuable information regarding my study and knowledge. Great Article… Good Job 국산야동

    Also do visit may web page check this link 한국야동

  11. Thank you for your kindness by providing valuable information to us. It really helped me to enhance my knowledge and skills 한국야동

    Also do visit may web page check this link 야동

  12. I visited last Monday, and in the meantime, I came back in 안전놀이터 anticipation that there might be other articles related to I know there is no regret and leave a comment. Your related articles are very good, keep going!!

  13. This is very interesting, You are a very skilled blogger. I've joined your rss feed and look forward to seeking more of your wonderful 메이저토토. Also, I have shared your website in my social networks!

  14. Hello, I read the post well. 안전놀이터추천 It's a really interesting topic and it has helped me a lot. In fact, I also run a website with similar content to your posting. Please visit once

  15. Pretty useful article. I merely stumbled upon your internet site and wanted to say that I’ve very favored learning your weblog posts. Any signifies I’ll be subscribing with your feed and I hope you publish once additional soon. 메이저사이트


Add a comment. Registration required because trolls.