Showing posts with label purine ratio. Show all posts
Showing posts with label purine ratio. Show all posts

Tuesday, April 29, 2014

Genes, Codons, and Purines

A universal feature of protein-coding genes is that they tend to use a lot of codons that begin with a purine (A or G). In fact, it's typical for a given gene's codons to use a purine in position one 60% or more of the time. But it's fair to ask: How universal is this trend, exactly? Does the rule apply for organisms with extremely high (or low) genomic G+C content? Does it apply for endosymbionts with greatly reduced genomes? Is it just a "sometimes" rule? Are there important exceptions?

I decided to collect codon statistics for 109 different bacterial species, representing members of all major taxonomic groups, with a wide range of genome sizes and GC percentages. For each organism, I determined the average percent A+G content in codon base one (AG1) across all CDS genes. Then I plotted AG1 against the genomic A+T content for each organism. (A+T is of course just one minus the G+C content.) Here's the graph of AG1 content for all the organisms:

Codon base-one purine content (average for all CDS genes) versus genomic A+T content for N=109 bacterial species. Dot size corresponds to genome size.

The fun thing about this graph is that each data point is sized according to the genome size of the organism in question (in other words, the area of the dot is proportional to genome size). As you can see, bacteria at the high end of the A+T scale (low G+C) tend to have smaller genomes. But the more important thing to notice is that AG1 is 58% or more for all 109 genomes. This means that the phenomenon of high average purine content in codon base one appears to be universal, at least for the sample group. (Organism names are listed in a table below.)

Of course, within a given genome, genes vary somewhat in terms of the per-gene average AG1, but it's still quite rare to find a protein gene that has average AG1 under 50%. For example, below is a histogram plot of AG1 content for all protein-coding genes of Sorangium cellulosum, a bacterium with genomic GC content of 72% (A+T = 28%).

Per-gene AG1 usage (codon base-one purine content) for all CDS genes of Sorangium cellulosum.

As you can see, very few genes lie to the left of x = 0.5. (Of Sorangium's 10,400 protein genes, only 321 have an average AG1 under 50%. Those could easily be mis-annotated genes or gene fragments.) Most organisms show much the same distribution of average AG1 values across CDS genes.

Gene annotation programs could probably benefit from using a check of AG1 to verify that a putative gene is in the correct reading frame. GC3 content is often used in this way, but AG1 is actually a much more discriminating test, especially with low-GC genomes (where the "wobble base" GC percentage is not particularly helpful).

Listed below are the 109 organisms (and their taxonomic categorizations) used in this investigation.

Organism
Taxon
Acidaminococcus fermentans strain DSM 20731 Firmicutes:Clostridia
Acidovorax avenae subsp. citrulli strain AAC00-1 Proteobacteria:Betaproteobacteria
Aerococcus urinae strain ACS-120-V-Col10a Firmicutes:Lactobacillales
Aeromonas hydrophila strain ML09-119 Proteobacteria:Gammaproteobacteria
Aggregatibacter actinomycetemcomitans D11S-1 Proteobacteria:Gammaproteobacteria
Agrobacterium radiobacter strain K84 Proteobacteria:Alphaproteobacteria
Anaerobaculum mobile strain DSM 13181 Synergistetes:Synergistia
Anaerocellum thermophilum strain DSM 6725 Firmicutes:Clostridia
Anaerolinea thermophila strain UNI-1 Chloroflexi:Anaerolineae
Anaplasma marginale strain Florida Proteobacteria:Alphaproteobacteria
Arcobacter butzleri ED-1 Proteobacteria:Epsilonproteobacteria
Atopobium vaginae strain DSM 15829 Actinobacteria:Coriobacteridae
Azospirillum brasilense strain Sp245 Proteobacteria:Alphaproteobacteria
Bacillus amyloliquefaciens strain Y2 Firmicutes:Bacilli
Bacillus anthracis strain CDC 684 Firmicutes:Bacillales
Bacillus subtilis BEST7613 strain PCC 6803 Firmicutes:Bacilli
Bacteroides dorei strain 5_1_36/D4 Bacteroidetes:Bacteroidia
Bartonella quintana strain RM-11 Proteobacteria:Alphaproteobacteria
Blastococcus saxobsidens strain DD2 Actinobacteria:Actinobacteridae
Borrelia miyamotoi strain LB-2001 Spirochaetes:Spirochaetales
Brachybacterium faecium strain DSM 4810 Actinobacteria:Actinobacteridae
Brucella ovis strain ATCC 25840 Proteobacteria:Alphaproteobacteria
Buchnera aphidicola (Acyrthosiphon pisum) strain 5A Proteobacteria:Gammaproteobacteria
Burkholderia pseudomallei strain 1710b Proteobacteria:Betaproteobacteria
Caldicellulosiruptor lactoaceticus strain 6A Firmicutes:Clostridia
Calditerrivibrio nitroreducens strain DSM 19672 Deferribacteres:Deferribacterales
Campylobacter concisus strain 13826 Proteobacteria:Epsilonproteobacteria
Candidatus Cloacamonas acidaminovorans candidate division WWE1:Candidatus Cloacamonas
Candidatus Methylomirabilis oxyfera candidate division NC10:Candidatus Methylomirabilis
Candidatus Pelagibacter ubique strain HTCC1062 Proteobacteria:Alphaproteobacteria
Carboxydothermus hydrogenoformans strain Z-2901 Firmicutes:Clostridia
Chlamyda trachomatis (i) strain L2/434/Bu; i Chlamydiae:Chlamydiales
Clostridium botulinum A strain Hall Firmicutes:Clostridia
Coprobacillus sp. strain 8_2_54BFAA Firmicutes:Erysipelotrichia
Coprococcus catus strain GD/7 Firmicutes:Clostridia
Cycloclasticus zancles strain 7-ME Proteobacteria:Gammaproteobacteria
Deinococcus radiodurans strain R1 Deinococcus-Thermus:Deinococci
Desulfococcus oleovorans strain Hxd3 Proteobacteria:Deltaproteobacteria
Ehrlichia canis strain Jake Proteobacteria:Alphaproteobacteria
Enterobacter cloacae strain SCF1 Proteobacteria:Gammaproteobacteria
Erwinia amylovora strain ATCC 49946 Proteobacteria:Gammaproteobacteria
Escherichia coli B strain REL606 Proteobacteria:Gammaproteobacteria
Geobacillus kaustophilus strain HTA426 Firmicutes:Bacillales
Geobacillus thermoleovorans strain CCB_US3_UF5 Firmicutes:Bacillales
Geobacter metallireducens strain GS-15 Proteobacteria:Deltaproteobacteria
Geobacter sulfurreducens strain KN400 Proteobacteria:Deltaproteobacteria
Geobacter sulfurreducens strain PCA Proteobacteria:Deltaproteobacteria
Geobacter uraniireducens strain Rf4 Proteobacteria:Deltaproteobacteria
Geodermatophilus obscurus strain DSM 43160 Actinobacteria:Actinobacteridae
Gordonia bronchialis strain DSM 43247 Actinobacteria:Actinobacteridae
Haemophilus ducreyi strain 35000HP Proteobacteria:Gammaproteobacteria
Halogeometricum borinquense DSM 11551 Euryarchaeota:Halobacteria
Helicobacter pylori (Helicobacter pylori SAfr7) strain SouthAfrica7 Proteobacteria:Epsilonproteobacteria
Klebsiella oxytoca strain 10-5243 Proteobacteria:Gammaproteobacteria
Kribbella flavida strain DSM 17836 Actinobacteria:Actinobacteridae
Ktedonobacter racemifer DSM 44963 Chloroflexi:Ktedonobacteria
Lactobacillus acidophilus strain 30SC Firmicutes:Lactobacillales
Lactobacillus reuteri strain MM4-1A Firmicutes:Lactobacillales
Lactococcus lactis subsp. cremoris strain A76 Firmicutes:Bacilli
Leptolyngbya sp. PCC 7376 Cyanobacteria:Oscillatoriophycideae
Leptonema illini strain DSM 21528 Spirochaetes:Spirochaetales
Leptospira biflexa serovar Patoc strain Ames; Patoc 1 Spirochaetes:Spirochaetales
Leuconostoc gasicomitatum LMG 18811 strain type LMG 18811 Firmicutes:Lactobacillales
Mesorhizobium australicum strain WSM2073 Proteobacteria:Alphaproteobacteria
Mesorhizobium ciceri biovar biserrulae strain WSM1271 Proteobacteria:Alphaproteobacteria
Methylobacillus flagellatus strain KT Proteobacteria:Betaproteobacteria
Methylophaga sp. strain JAM7 Proteobacteria:Gammaproteobacteria
Mycobacterium tuberculosis = ATCC 35801 strain ATCC35801; Erdman Actinobacteria:Actinobacteridae
Mycoplasma gallisepticum strain F Tenericutes:Mollicutes
Neisseria gonorrhoeae strain NCCP11945 Proteobacteria:Betaproteobacteria
Nocardia brasiliensis ATCC 700358 strain HUJEG-1 Actinobacteria:Actinobacteridae
Nocardia cyriacigeorgica strain GUH-2 Actinobacteria:Actinobacteridae
Nostoc sp. PCC 7120 (Anabaena sp. PCC 7120) strain PCC7120 Cyanobacteria:Nostocales
Novosphingobium aromaticivorans strain DSM 12444 Proteobacteria:Alphaproteobacteria
Oceanobacillus kimchii strain X50 Firmicutes:Bacilli
Orientia tsutsugamushi strain Ikeda Proteobacteria:Alphaproteobacteria
Paenibacillus polymyxa strain M1 Firmicutes:Bacilli
Polynucleobacter necessarius strain STIR1 Proteobacteria:Betaproteobacteria
Propionibacterium acnes TypeIA2 strain P.acn33 Actinobacteria:Actinobacteridae
Proteus mirabilis strain HI4320 Proteobacteria:Gammaproteobacteria
Pseudomonas fluorescens strain Pf0-1 Proteobacteria:Gammaproteobacteria
Pseudonocardia dioxanivorans strain CB1190 Actinobacteria:Actinobacteridae
Ralstonia eutropha strain H16 Proteobacteria:Betaproteobacteria
Rhizobium tropici strain CIAT 899 Proteobacteria:Alphaproteobacteria
Rhodobacter sphaeroides ATCC 17029 Proteobacteria:Alphaproteobacteria
Shigella boydii strain Sb227 Proteobacteria:Gammaproteobacteria
Slackia heliotrinireducens strain DSM 20476 Actinobacteria:Coriobacteridae
Sorangium cellulosum strain So0157-2 Proteobacteria:Deltaproteobacteria
Staphylococcus aureus strain 04-02981 Firmicutes:Bacillales
Streptococcus agalactiae strain 2603V/R Firmicutes:Lactobacillales
Streptomyces cf. griseus strain XylebKG-1 Actinobacteria:Actinobacteridae
Streptosporangium roseum strain DSM 43021 Actinobacteria:Actinobacteridae
Sulfurimonas denitrificans DSM 1251 strain ATCC 33889 Proteobacteria:Epsilonproteobacteria
Thioalkalivibrio nitratireducens strain DSM 14787 Proteobacteria:Gammaproteobacteria
Thiobacillus denitrificans strain ATCC 25259 Proteobacteria:Betaproteobacteria
Treponema azotonutricium strain ZAS-9 Spirochaetes:Spirochaetales
Treponema pedis strain T A4 Spirochaetes:Spirochaetales
Turneriella parva strain DSM 21527 Spirochaetes:Spirochaetales
Vibrio cholerae strain BX 330286 Proteobacteria:Gammaproteobacteria
Wolbachia endosymbiont strain TRS of Brugia malayi Proteobacteria:Alphaproteobacteria
Yersinia pestis D106004 Proteobacteria:Gammaproteobacteria
Bacillus thuringiensis serovar andalousiensis strain BGSC 4AW1 Firmicutes:Bacillales
Ureaplasma urealyticum serovar 5 strain ATCC 27817 Tenericutes:Mollicutes
Bordetella pertussis strain 18323 Proteobacteria:Betaproteobacteria
Comamonas testosteroni strain KF-1 Proteobacteria:Betaproteobacteria
Eikenella corrodens strain ATCC 23834 Proteobacteria:Betaproteobacteria
Janthinobacterium sp. strain Marseille Proteobacteria:Betaproteobacteria
Rhodopirellula baltica SH strain 1 Planctomycetes:Planctomycetacia
Blastopirellula marina strain DSM 3645 Planctomycetes:Planctomycetacia

Sunday, July 14, 2013

DNA Strand Asymmetry: More Surprises

The surprises just keep coming. When you start doing comparative genomics on the desktop (which is so easy with all the great tools at genomevolution.org and elsewhere), it's amazing how quickly you run into things that make you slap yourself on the side of the head and go "Whaaaa????"

If you know anything about DNA (or even if you don't), this one will set you back.

I've written before about Chargaff's second parity rule, which (peculiarly) states that A = T and G = C not just for double-stranded DNA (that's the first parity rule) but for bases in a single strand of DNA. The first parity rule is basic: It's what allows one strand of DNA to be complementary to another. The second parity rule is not so intuitive. Why should the amount of adenine have to equal the amount of thymine (or guanine equal cytosine) in a single strand of DNA? The conventional argument is that nature doesn't play favorites with purines and pyrimidines. There's no reason (in theory) why a single strand of DNA should have an excess of purines over pyrimidines or vice versa, all things being equal.

But it turns out, strand asymmetry vis-a-vis purines and pyrimidines is not only not uncommon, it's the rule. (Some call it Szybalski's rule, in fact.) You can prove it to yourself very easily. If you obtain a codon usage chart for a particular organism, then add the frequencies of occurrence of each base in each codon, you can get the relative abundances of the four bases (A, G, T, C) for the coding regions on which the codon chart was based. Let's take a simple example that requires no calculation: Clostridium botulinum. Just by eyeballing the chart below, you can quickly see that (for C. botulinum) codons using purines A and G are way-more-often used than codons containing pyrimidines T and C. (Note the green-highlighted codons.)


If you do the math, you'll find that in C. botulinum, G and A (combined) outnumber T and C by a factor of 1.41. That's a pretty extreme purine:pyrimidine ratio. (Remember that we're dealing with a single strand of DNA here. Codon frequencies are derived from the so-called "message strand" of DNA in coding regions.)

I've done this calculation for 1,373 different bacterial species (don't worry, it's all automated), and the bottom line is, the greater the DNA's A+T content (or, equivalently, the less its G+C content), the greater the purine imbalance. (See this post for a nice graph.)

If you inspect enough codon charts you'll quickly realize that Chargaff's second parity rule never holds true (except now and then by chance). It's a bogus rule, at least in coding regions (DNA that actually gets transcribed in vivo). It may have applicability to pseudogenes or "junk DNA" (but then again, I haven't checked; it may well not apply there either).

If Chargaff's second rule were true, we would expect to find that G = C (and A = T), because that's what the rule says. I went through the codon frequency data for 1,373 different bacterial species and then plotted the ratio of G to C (which Chargaff says should equal 1.0) for each species against the A+T content (which is a kind of phylogenetic signature) for each species. I was shocked by what I found:

Using base abundances derived from codon frequency data, I calculated G/C for 1,373 bacterial species and plotted it against total A+T content. (Each dot represents a genome for a particular organism.) Chargaff's second parity rule predicts a horizontal line at y=1.0. Clearly, that rule doesn't hold. 

I wasn't so much shocked by the fact that Chargaff's rule doesn't hold; I already knew that. What's shocking is that the ratio of G to C goes up as A+T increases, which means G/C is going up even as G+C is going down. (By definition, G+C goes down as A+T goes up.)

Chargaff says G/C should always equal 1.0. In reality, it never does except by chance. What we find is, the less G (or C) the DNA has, the greater the ratio of G to C. To put it differently: At the high-AT end of the phylogenetic scale, cytosine is decreasing faster (much faster) than guanine, as overall G+C content goes down.

When I first plotted this graph, I used a linear regression to get a line that minimizes the sum of squared absolute error. That line turned out to be given by 0.638 + [A+T]. Then I saw that the data looked exponential, not linear. So I refitted the data with a power curve (the red curve shown above) given by

G/C  = 1.0 + 0.587*[A+T] + 1.618*[A+T]2

which fit the data even better (minimum summed error 0.1119 instead of 0.1197). What struck me as strange is that the Golden Ratio (1.618) shows up in the power-curve formula (above), but also, the linear form of the regression has G/C equaliing 1.638 when [A+T] goes to 1.0. Which is almost the Golden Ratio.

In a previous post, I mentioned finding that the ratio A/T tends to approximate the Golden Ratio as A+T approaches 1.0. If this were to hold true, it could mean that A/T and G/C both approach the Golden Ratio as A+T approaches 1.0, which would be weird indeed.

For now, I'm not going to make the claim that the Golden Ratio figures into any of this, because it reeks too much of numerology and Intelligent Design (and I'm a fan of neither). I do think it's mildly interesting that A/T and G/C both approach a similar number as A+T approaches unity.

Comments, as usual, are welcome.