blogorrhea: purine ratio

A universal feature of protein-coding genes is that they tend to use a lot of codons that begin with a purine (A or G). In fact, it's typical for a given gene's codons to use a purine in position one 60% or more of the time. But it's fair to ask: How universal is this trend, exactly? Does the rule apply for organisms with extremely high (or low) genomic G+C content? Does it apply for endosymbionts with greatly reduced genomes? Is it just a "sometimes" rule? Are there important exceptions?

I decided to collect codon statistics for 109 different bacterial species, representing members of all major taxonomic groups, with a wide range of genome sizes and GC percentages. For each organism, I determined the average percent A+G content in codon base one (AG1) across all CDS genes. Then I plotted AG1 against the genomic A+T content for each organism. (A+T is of course just one minus the G+C content.) Here's the graph of AG1 content for all the organisms:

Codon base-one purine content (average for all CDS genes) versus genomic A+T content for N=109 bacterial species. Dot size corresponds to genome size.

The fun thing about this graph is that each data point is sized according to the genome size of the organism in question (in other words, the area of the dot is proportional to genome size). As you can see, bacteria at the high end of the A+T scale (low G+C) tend to have smaller genomes. But the more important thing to notice is that AG1 is 58% or more for all 109 genomes. This means that the phenomenon of high average purine content in codon base one appears to be universal, at least for the sample group. (Organism names are listed in a table below.)

Of course, within a given genome, genes vary somewhat in terms of the per-gene average AG1, but it's still quite rare to find a protein gene that has average AG1 under 50%. For example, below is a histogram plot of AG1 content for all protein-coding genes of Sorangium cellulosum, a bacterium with genomic GC content of 72% (A+T = 28%).

Per-gene AG1 usage (codon base-one purine content) for all CDS genes of Sorangium cellulosum.

As you can see, very few genes lie to the left of x = 0.5. (Of Sorangium's 10,400 protein genes, only 321 have an average AG1 under 50%. Those could easily be mis-annotated genes or gene fragments.) Most organisms show much the same distribution of average AG1 values across CDS genes.

Gene annotation programs could probably benefit from using a check of AG1 to verify that a putative gene is in the correct reading frame. GC3 content is often used in this way, but AG1 is actually a much more discriminating test, especially with low-GC genomes (where the "wobble base" GC percentage is not particularly helpful).

Listed below are the 109 organisms (and their taxonomic categorizations) used in this investigation.

Organism	Taxon
Acidaminococcus fermentans strain DSM 20731	Firmicutes:Clostridia
Acidovorax avenae subsp. citrulli strain AAC00-1	Proteobacteria:Betaproteobacteria
Aerococcus urinae strain ACS-120-V-Col10a	Firmicutes:Lactobacillales
Aeromonas hydrophila strain ML09-119	Proteobacteria:Gammaproteobacteria
Aggregatibacter actinomycetemcomitans D11S-1	Proteobacteria:Gammaproteobacteria
Agrobacterium radiobacter strain K84	Proteobacteria:Alphaproteobacteria
Anaerobaculum mobile strain DSM 13181	Synergistetes:Synergistia
Anaerocellum thermophilum strain DSM 6725	Firmicutes:Clostridia
Anaerolinea thermophila strain UNI-1	Chloroflexi:Anaerolineae
Anaplasma marginale strain Florida	Proteobacteria:Alphaproteobacteria
Arcobacter butzleri ED-1	Proteobacteria:Epsilonproteobacteria
Atopobium vaginae strain DSM 15829	Actinobacteria:Coriobacteridae
Azospirillum brasilense strain Sp245	Proteobacteria:Alphaproteobacteria
Bacillus amyloliquefaciens strain Y2	Firmicutes:Bacilli
Bacillus anthracis strain CDC 684	Firmicutes:Bacillales
Bacillus subtilis BEST7613 strain PCC 6803	Firmicutes:Bacilli
Bacteroides dorei strain 5_1_36/D4	Bacteroidetes:Bacteroidia
Bartonella quintana strain RM-11	Proteobacteria:Alphaproteobacteria
Blastococcus saxobsidens strain DD2	Actinobacteria:Actinobacteridae
Borrelia miyamotoi strain LB-2001	Spirochaetes:Spirochaetales
Brachybacterium faecium strain DSM 4810	Actinobacteria:Actinobacteridae
Brucella ovis strain ATCC 25840	Proteobacteria:Alphaproteobacteria
Buchnera aphidicola (Acyrthosiphon pisum) strain 5A	Proteobacteria:Gammaproteobacteria
Burkholderia pseudomallei strain 1710b	Proteobacteria:Betaproteobacteria
Caldicellulosiruptor lactoaceticus strain 6A	Firmicutes:Clostridia
Calditerrivibrio nitroreducens strain DSM 19672	Deferribacteres:Deferribacterales
Campylobacter concisus strain 13826	Proteobacteria:Epsilonproteobacteria
Candidatus Cloacamonas acidaminovorans	candidate division WWE1:Candidatus Cloacamonas
Candidatus Methylomirabilis oxyfera	candidate division NC10:Candidatus Methylomirabilis
Candidatus Pelagibacter ubique strain HTCC1062	Proteobacteria:Alphaproteobacteria
Carboxydothermus hydrogenoformans strain Z-2901	Firmicutes:Clostridia
Chlamyda trachomatis (i) strain L2/434/Bu; i	Chlamydiae:Chlamydiales
Clostridium botulinum A strain Hall	Firmicutes:Clostridia
Coprobacillus sp. strain 8_2_54BFAA	Firmicutes:Erysipelotrichia
Coprococcus catus strain GD/7	Firmicutes:Clostridia
Cycloclasticus zancles strain 7-ME	Proteobacteria:Gammaproteobacteria
Deinococcus radiodurans strain R1	Deinococcus-Thermus:Deinococci
Desulfococcus oleovorans strain Hxd3	Proteobacteria:Deltaproteobacteria
Ehrlichia canis strain Jake	Proteobacteria:Alphaproteobacteria
Enterobacter cloacae strain SCF1	Proteobacteria:Gammaproteobacteria
Erwinia amylovora strain ATCC 49946	Proteobacteria:Gammaproteobacteria
Escherichia coli B strain REL606	Proteobacteria:Gammaproteobacteria
Geobacillus kaustophilus strain HTA426	Firmicutes:Bacillales
Geobacillus thermoleovorans strain CCB_US3_UF5	Firmicutes:Bacillales
Geobacter metallireducens strain GS-15	Proteobacteria:Deltaproteobacteria
Geobacter sulfurreducens strain KN400	Proteobacteria:Deltaproteobacteria
Geobacter sulfurreducens strain PCA	Proteobacteria:Deltaproteobacteria
Geobacter uraniireducens strain Rf4	Proteobacteria:Deltaproteobacteria
Geodermatophilus obscurus strain DSM 43160	Actinobacteria:Actinobacteridae
Gordonia bronchialis strain DSM 43247	Actinobacteria:Actinobacteridae
Haemophilus ducreyi strain 35000HP	Proteobacteria:Gammaproteobacteria
Halogeometricum borinquense DSM 11551	Euryarchaeota:Halobacteria
Helicobacter pylori (Helicobacter pylori SAfr7) strain SouthAfrica7	Proteobacteria:Epsilonproteobacteria
Klebsiella oxytoca strain 10-5243	Proteobacteria:Gammaproteobacteria
Kribbella flavida strain DSM 17836	Actinobacteria:Actinobacteridae
Ktedonobacter racemifer DSM 44963	Chloroflexi:Ktedonobacteria
Lactobacillus acidophilus strain 30SC	Firmicutes:Lactobacillales
Lactobacillus reuteri strain MM4-1A	Firmicutes:Lactobacillales
Lactococcus lactis subsp. cremoris strain A76	Firmicutes:Bacilli
Leptolyngbya sp. PCC 7376	Cyanobacteria:Oscillatoriophycideae
Leptonema illini strain DSM 21528	Spirochaetes:Spirochaetales
Leptospira biflexa serovar Patoc strain Ames; Patoc 1	Spirochaetes:Spirochaetales
Leuconostoc gasicomitatum LMG 18811 strain type LMG 18811	Firmicutes:Lactobacillales
Mesorhizobium australicum strain WSM2073	Proteobacteria:Alphaproteobacteria
Mesorhizobium ciceri biovar biserrulae strain WSM1271	Proteobacteria:Alphaproteobacteria
Methylobacillus flagellatus strain KT	Proteobacteria:Betaproteobacteria
Methylophaga sp. strain JAM7	Proteobacteria:Gammaproteobacteria
Mycobacterium tuberculosis = ATCC 35801 strain ATCC35801; Erdman	Actinobacteria:Actinobacteridae
Mycoplasma gallisepticum strain F	Tenericutes:Mollicutes
Neisseria gonorrhoeae strain NCCP11945	Proteobacteria:Betaproteobacteria
Nocardia brasiliensis ATCC 700358 strain HUJEG-1	Actinobacteria:Actinobacteridae
Nocardia cyriacigeorgica strain GUH-2	Actinobacteria:Actinobacteridae
Nostoc sp. PCC 7120 (Anabaena sp. PCC 7120) strain PCC7120	Cyanobacteria:Nostocales
Novosphingobium aromaticivorans strain DSM 12444	Proteobacteria:Alphaproteobacteria
Oceanobacillus kimchii strain X50	Firmicutes:Bacilli
Orientia tsutsugamushi strain Ikeda	Proteobacteria:Alphaproteobacteria
Paenibacillus polymyxa strain M1	Firmicutes:Bacilli
Polynucleobacter necessarius strain STIR1	Proteobacteria:Betaproteobacteria
Propionibacterium acnes TypeIA2 strain P.acn33	Actinobacteria:Actinobacteridae
Proteus mirabilis strain HI4320	Proteobacteria:Gammaproteobacteria
Pseudomonas fluorescens strain Pf0-1	Proteobacteria:Gammaproteobacteria
Pseudonocardia dioxanivorans strain CB1190	Actinobacteria:Actinobacteridae
Ralstonia eutropha strain H16	Proteobacteria:Betaproteobacteria
Rhizobium tropici strain CIAT 899	Proteobacteria:Alphaproteobacteria
Rhodobacter sphaeroides ATCC 17029	Proteobacteria:Alphaproteobacteria
Shigella boydii strain Sb227	Proteobacteria:Gammaproteobacteria
Slackia heliotrinireducens strain DSM 20476	Actinobacteria:Coriobacteridae
Sorangium cellulosum strain So0157-2	Proteobacteria:Deltaproteobacteria
Staphylococcus aureus strain 04-02981	Firmicutes:Bacillales
Streptococcus agalactiae strain 2603V/R	Firmicutes:Lactobacillales
Streptomyces cf. griseus strain XylebKG-1	Actinobacteria:Actinobacteridae
Streptosporangium roseum strain DSM 43021	Actinobacteria:Actinobacteridae
Sulfurimonas denitrificans DSM 1251 strain ATCC 33889	Proteobacteria:Epsilonproteobacteria
Thioalkalivibrio nitratireducens strain DSM 14787	Proteobacteria:Gammaproteobacteria
Thiobacillus denitrificans strain ATCC 25259	Proteobacteria:Betaproteobacteria
Treponema azotonutricium strain ZAS-9	Spirochaetes:Spirochaetales
Treponema pedis strain T A4	Spirochaetes:Spirochaetales
Turneriella parva strain DSM 21527	Spirochaetes:Spirochaetales
Vibrio cholerae strain BX 330286	Proteobacteria:Gammaproteobacteria
Wolbachia endosymbiont strain TRS of Brugia malayi	Proteobacteria:Alphaproteobacteria
Yersinia pestis D106004	Proteobacteria:Gammaproteobacteria
Bacillus thuringiensis serovar andalousiensis strain BGSC 4AW1	Firmicutes:Bacillales
Ureaplasma urealyticum serovar 5 strain ATCC 27817	Tenericutes:Mollicutes
Bordetella pertussis strain 18323	Proteobacteria:Betaproteobacteria
Comamonas testosteroni strain KF-1	Proteobacteria:Betaproteobacteria
Eikenella corrodens strain ATCC 23834	Proteobacteria:Betaproteobacteria
Janthinobacterium sp. strain Marseille	Proteobacteria:Betaproteobacteria
Rhodopirellula baltica SH strain 1	Planctomycetes:Planctomycetacia
Blastopirellula marina strain DSM 3645	Planctomycetes:Planctomycetacia

The surprises just keep coming. When you start doing comparative genomics on the desktop (which is so easy with all the great tools at genomevolution.org and elsewhere), it's amazing how quickly you run into things that make you slap yourself on the side of the head and go "Whaaaa????"

If you know anything about DNA (or even if you don't), this one will set you back.

I've written before about Chargaff's second parity rule, which (peculiarly) states that A = T and G = C not just for double-stranded DNA (that's the first parity rule) but for bases in a single strand of DNA. The first parity rule is basic: It's what allows one strand of DNA to be complementary to another. The second parity rule is not so intuitive. Why should the amount of adenine have to equal the amount of thymine (or guanine equal cytosine) in a single strand of DNA? The conventional argument is that nature doesn't play favorites with purines and pyrimidines. There's no reason (in theory) why a single strand of DNA should have an excess of purines over pyrimidines or vice versa, all things being equal.

But it turns out, strand asymmetry vis-a-vis purines and pyrimidines is not only not uncommon, it's the rule. (Some call it Szybalski's rule, in fact.) You can prove it to yourself very easily. If you obtain a codon usage chart for a particular organism, then add the frequencies of occurrence of each base in each codon, you can get the relative abundances of the four bases (A, G, T, C) for the coding regions on which the codon chart was based. Let's take a simple example that requires no calculation: Clostridium botulinum. Just by eyeballing the chart below, you can quickly see that (for C. botulinum) codons using purines A and G are way-more-often used than codons containing pyrimidines T and C. (Note the green-highlighted codons.)

If you do the math, you'll find that in C. botulinum, G and A (combined) outnumber T and C by a factor of 1.41. That's a pretty extreme purine:pyrimidine ratio. (Remember that we're dealing with a single strand of DNA here. Codon frequencies are derived from the so-called "message strand" of DNA in coding regions.)

I've done this calculation for 1,373 different bacterial species (don't worry, it's all automated), and the bottom line is, the greater the DNA's A+T content (or, equivalently, the less its G+C content), the greater the purine imbalance. (See this post for a nice graph.)

If you inspect enough codon charts you'll quickly realize that Chargaff's second parity rule never holds true (except now and then by chance). It's a bogus rule, at least in coding regions (DNA that actually gets transcribed in vivo). It may have applicability to pseudogenes or "junk DNA" (but then again, I haven't checked; it may well not apply there either).

If Chargaff's second rule were true, we would expect to find that G = C (and A = T), because that's what the rule says. I went through the codon frequency data for 1,373 different bacterial species and then plotted the ratio of G to C (which Chargaff says should equal 1.0) for each species against the A+T content (which is a kind of phylogenetic signature) for each species. I was shocked by what I found:

Using base abundances derived from codon frequency data, I calculated G/C for 1,373 bacterial species and plotted it against total A+T content. (Each dot represents a genome for a particular organism.) Chargaff's second parity rule predicts a horizontal line at y=1.0. Clearly, that rule doesn't hold.

I wasn't so much shocked by the fact that Chargaff's rule doesn't hold; I already knew that. What's shocking is that the ratio of G to C goes up as A+T increases, which means G/C is going up even as G+C is going down. (By definition, G+C goes down as A+T goes up.)

Chargaff says G/C should always equal 1.0. In reality, it never does except by chance. What we find is, the less G (or C) the DNA has, the greater the ratio of G to C. To put it differently: At the high-AT end of the phylogenetic scale, cytosine is decreasing faster (much faster) than guanine, as overall G+C content goes down.

When I first plotted this graph, I used a linear regression to get a line that minimizes the sum of squared absolute error. That line turned out to be given by 0.638 + [A+T]. Then I saw that the data looked exponential, not linear. So I refitted the data with a power curve (the red curve shown above) given by

G/C = 1.0 + 0.587*[A+T] + 1.618*[A+T]²

which fit the data even better (minimum summed error 0.1119 instead of 0.1197). What struck me as strange is that the Golden Ratio (1.618) shows up in the power-curve formula (above), but also, the linear form of the regression has G/C equaliing 1.638 when [A+T] goes to 1.0. Which is almost the Golden Ratio.

In a previous post, I mentioned finding that the ratio A/T tends to approximate the Golden Ratio as A+T approaches 1.0. If this were to hold true, it could mean that A/T and G/C both approach the Golden Ratio as A+T approaches 1.0, which would be weird indeed.

For now, I'm not going to make the claim that the Golden Ratio figures into any of this, because it reeks too much of numerology and Intelligent Design (and I'm a fan of neither). I do think it's mildly interesting that A/T and G/C both approach a similar number as A+T approaches unity.

Comments, as usual, are welcome.

blogorrhea

Tuesday, April 29, 2014

Genes, Codons, and Purines

Sunday, July 14, 2013

DNA Strand Asymmetry: More Surprises

Past Posts