One reason genomes are so poorly annotated is that annotation software (of the Glimmer variety) gets easily confused by high-GC-content genome data. When a genome is high in guanine and cytosine, relatively few stop codons are present in alternate reading frames. (Recall that DNA is read in triplets of letters, called codons: AAA, AAC, AAT, AGT, etc. There are 6 possible reading frames for any given segment of DNA, representing 3 forward reading frames and 3 backward frames.) Stop codons (TGA, TAG, TAA) are mostly composed of A and T, not G and C. But also, it so happens that most protein-coding genes follow a certain pattern of codon construction. The first base in a 3-letter codon is usually A or G (about 60% of the time, in most genes, in most organisms). The second base is highly variable in all respects. The third base is usually reflective of overall genome composition: If the genome is high in G and C, the third base of each codon will tend to be G or C. (This happens almost all the time in Streptomyces, for example, where the third base G+C content is 97%.) If the genome is high in A and T, the third codon base will be high in A and T.
What's perhaps unexpected is that the same compositional pattern can sometimes work in reverse, on the complementing DNA strand. When you look at a protein's codons and see that 60% use a purine in the first base and but only 40% use a purine in the third base, this means that the reverse complement of the codon also has 60% purine content in base one and 40% purine content in base three. Example: the codon GCT (alanine) has a purine (G) followed by a pyrimidine (C) followed by a pyrimidine. The reverse complement codon, AGC (serine) also begins with a purine (A) and ends with a pyrimidine (C). This type of symmetry tends to be a confounding factor for programs like Glimmer that try to distinguish sense from antisense strands and coding from non-coding regions, and normal reading frames from nonsense frames.
Perhaps some real-world data will make this clearer. Below is a plot of AG1 (purine content at base one) versus GC1 (guanine plus cytosine, base one) for all codons of all genes of the soil bacterium Pseudomonas fluorescens PF0-1. Each point represents one gene's worth of data. For each data point, I simply went through all of that gene's codons and tallied up the A, G, C, and T at each base position, then found the average AG1 and GC1 for the gene in question. I did this for all 5,722 protein-coding genes in the genome. (Don't worry. Scripts do the whole thing in the blink of an eye. It takes less than 10 milliseconds to process one gene's worth of data.) Notice how the points cluster at y=0.6, meaning most genes have an average AG1 (first-base purine) content of around 60%.
Now have a look at the graph below, which is the same kind of plot except we're looking at data for the third codon base. Here, the median y-value is only 0.453, meaning that purine content averages around 45% in the third base. That means on the opposite strand, in the same position, there's a purine ~55% of the time.
I ran some numbers and found that in P. fluorescens, 74% of genes have codons that are purine-heavy on the front (AG1 greater than 55%), in the normal reading direction, but most genes (53%) are also purine-heavy in base one when read in the reverse-complement sense. In fact, for every three protein genes that have AG1 greater than 55% and GC3 above 60% in the normal reading frame, there are two genes that meet the same criteria when translated in the reverse-complement frame. This means that for quite a few genes in Pseudomonas, the normal reading frame has similar compositional statistics to the reverse reading frame. (Note that codon base two tends to average 48% purines in the forward direction and 52% in the back direction.) To a program like Glimmer, many protein genes look surprisingly similar whether read from the sense strand of DNA or the antisense strand. Distinguishing sense from antisense is not trivial, in other words, although in an upcoming post I'll talk about a way to do it via codon bias.
To get a better idea of how universal this bidirectional codon symmetry might be, I obtained the codon statistics for 109 organisms and calculated the average base-3 composition stats, and came up with the following graph that plots AG3 (purines, base three) against overall genome A+T content:
The size of each circle is proportional to the genome size of the organism in question. (As you can see, organisms with high genomic A+T content often tend to have smaller genomes.) Notice that base three tends to be a pyrimidine (AG3 less than 50%), in the normal reading frame, for almost all organisms. (That means it's a purine in the reverse reading frame.) A few circles appear above y=0.5, but not many, and not by much.
Bottom line: Distinguishing sense from antisense DNA is not a straightforward matter, since many codons have similar composition statistics in forward and backward frames. Breaking the deadlock might require finding a Shine Dalgarno sequence for one frame but not the other, or it could mean running a homology check (via BLAST) against similar genes in a database, but in that case you have to hope the strand assignment was correct in the database genes (which it often is not).
Incidentally, I know of no a priori reason why the wobble base should accumulate pyrimidines preferentially (even though that's what the data say). In theory, base three is (for most codons) a degenerate position and should be neutral (free to accumulate bases of any type). We know that base one tends to be a purine 60% of the time. The fact that base three is a pyrimidine 54.7% of the time is suspicious (arguably) and tends to imply that some genes are annotated backwards. As we'll see in a future post, the backwards-annotation problem isn't a huge issue in most organisms, but it's not negligible, either.
Showing posts with label GC3. Show all posts
Showing posts with label GC3. Show all posts
Sunday, May 11, 2014
Tuesday, April 29, 2014
Genes, Codons, and Purines
A universal feature of protein-coding genes is that they tend to use a lot of codons that begin with a purine (A or G). In fact, it's typical for a given gene's codons to use a purine in position one 60% or more of the time. But it's fair to ask: How universal is this trend, exactly? Does the rule apply for organisms with extremely high (or low) genomic G+C content? Does it apply for endosymbionts with greatly reduced genomes? Is it just a "sometimes" rule? Are there important exceptions?
I decided to collect codon statistics for 109 different bacterial species, representing members of all major taxonomic groups, with a wide range of genome sizes and GC percentages. For each organism, I determined the average percent A+G content in codon base one (AG1) across all CDS genes. Then I plotted AG1 against the genomic A+T content for each organism. (A+T is of course just one minus the G+C content.) Here's the graph of AG1 content for all the organisms:
The fun thing about this graph is that each data point is sized according to the genome size of the organism in question (in other words, the area of the dot is proportional to genome size). As you can see, bacteria at the high end of the A+T scale (low G+C) tend to have smaller genomes. But the more important thing to notice is that AG1 is 58% or more for all 109 genomes. This means that the phenomenon of high average purine content in codon base one appears to be universal, at least for the sample group. (Organism names are listed in a table below.)
Of course, within a given genome, genes vary somewhat in terms of the per-gene average AG1, but it's still quite rare to find a protein gene that has average AG1 under 50%. For example, below is a histogram plot of AG1 content for all protein-coding genes of Sorangium cellulosum, a bacterium with genomic GC content of 72% (A+T = 28%).
As you can see, very few genes lie to the left of x = 0.5. (Of Sorangium's 10,400 protein genes, only 321 have an average AG1 under 50%. Those could easily be mis-annotated genes or gene fragments.) Most organisms show much the same distribution of average AG1 values across CDS genes.
Gene annotation programs could probably benefit from using a check of AG1 to verify that a putative gene is in the correct reading frame. GC3 content is often used in this way, but AG1 is actually a much more discriminating test, especially with low-GC genomes (where the "wobble base" GC percentage is not particularly helpful).
Listed below are the 109 organisms (and their taxonomic categorizations) used in this investigation.
I decided to collect codon statistics for 109 different bacterial species, representing members of all major taxonomic groups, with a wide range of genome sizes and GC percentages. For each organism, I determined the average percent A+G content in codon base one (AG1) across all CDS genes. Then I plotted AG1 against the genomic A+T content for each organism. (A+T is of course just one minus the G+C content.) Here's the graph of AG1 content for all the organisms:
![]() |
Codon base-one purine content (average for all CDS genes) versus genomic A+T content for N=109 bacterial species. Dot size corresponds to genome size. |
The fun thing about this graph is that each data point is sized according to the genome size of the organism in question (in other words, the area of the dot is proportional to genome size). As you can see, bacteria at the high end of the A+T scale (low G+C) tend to have smaller genomes. But the more important thing to notice is that AG1 is 58% or more for all 109 genomes. This means that the phenomenon of high average purine content in codon base one appears to be universal, at least for the sample group. (Organism names are listed in a table below.)
Of course, within a given genome, genes vary somewhat in terms of the per-gene average AG1, but it's still quite rare to find a protein gene that has average AG1 under 50%. For example, below is a histogram plot of AG1 content for all protein-coding genes of Sorangium cellulosum, a bacterium with genomic GC content of 72% (A+T = 28%).
![]() |
Per-gene AG1 usage (codon base-one purine content) for all CDS genes of Sorangium cellulosum. |
As you can see, very few genes lie to the left of x = 0.5. (Of Sorangium's 10,400 protein genes, only 321 have an average AG1 under 50%. Those could easily be mis-annotated genes or gene fragments.) Most organisms show much the same distribution of average AG1 values across CDS genes.
Gene annotation programs could probably benefit from using a check of AG1 to verify that a putative gene is in the correct reading frame. GC3 content is often used in this way, but AG1 is actually a much more discriminating test, especially with low-GC genomes (where the "wobble base" GC percentage is not particularly helpful).
Listed below are the 109 organisms (and their taxonomic categorizations) used in this investigation.
Organism
|
Taxon
|
Acidaminococcus fermentans strain DSM 20731 | Firmicutes:Clostridia |
Acidovorax avenae subsp. citrulli strain AAC00-1 | Proteobacteria:Betaproteobacteria |
Aerococcus urinae strain ACS-120-V-Col10a | Firmicutes:Lactobacillales |
Aeromonas hydrophila strain ML09-119 | Proteobacteria:Gammaproteobacteria |
Aggregatibacter actinomycetemcomitans D11S-1 | Proteobacteria:Gammaproteobacteria |
Agrobacterium radiobacter strain K84 | Proteobacteria:Alphaproteobacteria |
Anaerobaculum mobile strain DSM 13181 | Synergistetes:Synergistia |
Anaerocellum thermophilum strain DSM 6725 | Firmicutes:Clostridia |
Anaerolinea thermophila strain UNI-1 | Chloroflexi:Anaerolineae |
Anaplasma marginale strain Florida | Proteobacteria:Alphaproteobacteria |
Arcobacter butzleri ED-1 | Proteobacteria:Epsilonproteobacteria |
Atopobium vaginae strain DSM 15829 | Actinobacteria:Coriobacteridae |
Azospirillum brasilense strain Sp245 | Proteobacteria:Alphaproteobacteria |
Bacillus amyloliquefaciens strain Y2 | Firmicutes:Bacilli |
Bacillus anthracis strain CDC 684 | Firmicutes:Bacillales |
Bacillus subtilis BEST7613 strain PCC 6803 | Firmicutes:Bacilli |
Bacteroides dorei strain 5_1_36/D4 | Bacteroidetes:Bacteroidia |
Bartonella quintana strain RM-11 | Proteobacteria:Alphaproteobacteria |
Blastococcus saxobsidens strain DD2 | Actinobacteria:Actinobacteridae |
Borrelia miyamotoi strain LB-2001 | Spirochaetes:Spirochaetales |
Brachybacterium faecium strain DSM 4810 | Actinobacteria:Actinobacteridae |
Brucella ovis strain ATCC 25840 | Proteobacteria:Alphaproteobacteria |
Buchnera aphidicola (Acyrthosiphon pisum) strain 5A | Proteobacteria:Gammaproteobacteria |
Burkholderia pseudomallei strain 1710b | Proteobacteria:Betaproteobacteria |
Caldicellulosiruptor lactoaceticus strain 6A | Firmicutes:Clostridia |
Calditerrivibrio nitroreducens strain DSM 19672 | Deferribacteres:Deferribacterales |
Campylobacter concisus strain 13826 | Proteobacteria:Epsilonproteobacteria |
Candidatus Cloacamonas acidaminovorans | candidate division WWE1:Candidatus Cloacamonas |
Candidatus Methylomirabilis oxyfera | candidate division NC10:Candidatus Methylomirabilis |
Candidatus Pelagibacter ubique strain HTCC1062 | Proteobacteria:Alphaproteobacteria |
Carboxydothermus hydrogenoformans strain Z-2901 | Firmicutes:Clostridia |
Chlamyda trachomatis (i) strain L2/434/Bu; i | Chlamydiae:Chlamydiales |
Clostridium botulinum A strain Hall | Firmicutes:Clostridia |
Coprobacillus sp. strain 8_2_54BFAA | Firmicutes:Erysipelotrichia |
Coprococcus catus strain GD/7 | Firmicutes:Clostridia |
Cycloclasticus zancles strain 7-ME | Proteobacteria:Gammaproteobacteria |
Deinococcus radiodurans strain R1 | Deinococcus-Thermus:Deinococci |
Desulfococcus oleovorans strain Hxd3 | Proteobacteria:Deltaproteobacteria |
Ehrlichia canis strain Jake | Proteobacteria:Alphaproteobacteria |
Enterobacter cloacae strain SCF1 | Proteobacteria:Gammaproteobacteria |
Erwinia amylovora strain ATCC 49946 | Proteobacteria:Gammaproteobacteria |
Escherichia coli B strain REL606 | Proteobacteria:Gammaproteobacteria |
Geobacillus kaustophilus strain HTA426 | Firmicutes:Bacillales |
Geobacillus thermoleovorans strain CCB_US3_UF5 | Firmicutes:Bacillales |
Geobacter metallireducens strain GS-15 | Proteobacteria:Deltaproteobacteria |
Geobacter sulfurreducens strain KN400 | Proteobacteria:Deltaproteobacteria |
Geobacter sulfurreducens strain PCA | Proteobacteria:Deltaproteobacteria |
Geobacter uraniireducens strain Rf4 | Proteobacteria:Deltaproteobacteria |
Geodermatophilus obscurus strain DSM 43160 | Actinobacteria:Actinobacteridae |
Gordonia bronchialis strain DSM 43247 | Actinobacteria:Actinobacteridae |
Haemophilus ducreyi strain 35000HP | Proteobacteria:Gammaproteobacteria |
Halogeometricum borinquense DSM 11551 | Euryarchaeota:Halobacteria |
Helicobacter pylori (Helicobacter pylori SAfr7) strain SouthAfrica7 | Proteobacteria:Epsilonproteobacteria |
Klebsiella oxytoca strain 10-5243 | Proteobacteria:Gammaproteobacteria |
Kribbella flavida strain DSM 17836 | Actinobacteria:Actinobacteridae |
Ktedonobacter racemifer DSM 44963 | Chloroflexi:Ktedonobacteria |
Lactobacillus acidophilus strain 30SC | Firmicutes:Lactobacillales |
Lactobacillus reuteri strain MM4-1A | Firmicutes:Lactobacillales |
Lactococcus lactis subsp. cremoris strain A76 | Firmicutes:Bacilli |
Leptolyngbya sp. PCC 7376 | Cyanobacteria:Oscillatoriophycideae |
Leptonema illini strain DSM 21528 | Spirochaetes:Spirochaetales |
Leptospira biflexa serovar Patoc strain Ames; Patoc 1 | Spirochaetes:Spirochaetales |
Leuconostoc gasicomitatum LMG 18811 strain type LMG 18811 | Firmicutes:Lactobacillales |
Mesorhizobium australicum strain WSM2073 | Proteobacteria:Alphaproteobacteria |
Mesorhizobium ciceri biovar biserrulae strain WSM1271 | Proteobacteria:Alphaproteobacteria |
Methylobacillus flagellatus strain KT | Proteobacteria:Betaproteobacteria |
Methylophaga sp. strain JAM7 | Proteobacteria:Gammaproteobacteria |
Mycobacterium tuberculosis = ATCC 35801 strain ATCC35801; Erdman | Actinobacteria:Actinobacteridae |
Mycoplasma gallisepticum strain F | Tenericutes:Mollicutes |
Neisseria gonorrhoeae strain NCCP11945 | Proteobacteria:Betaproteobacteria |
Nocardia brasiliensis ATCC 700358 strain HUJEG-1 | Actinobacteria:Actinobacteridae |
Nocardia cyriacigeorgica strain GUH-2 | Actinobacteria:Actinobacteridae |
Nostoc sp. PCC 7120 (Anabaena sp. PCC 7120) strain PCC7120 | Cyanobacteria:Nostocales |
Novosphingobium aromaticivorans strain DSM 12444 | Proteobacteria:Alphaproteobacteria |
Oceanobacillus kimchii strain X50 | Firmicutes:Bacilli |
Orientia tsutsugamushi strain Ikeda | Proteobacteria:Alphaproteobacteria |
Paenibacillus polymyxa strain M1 | Firmicutes:Bacilli |
Polynucleobacter necessarius strain STIR1 | Proteobacteria:Betaproteobacteria |
Propionibacterium acnes TypeIA2 strain P.acn33 | Actinobacteria:Actinobacteridae |
Proteus mirabilis strain HI4320 | Proteobacteria:Gammaproteobacteria |
Pseudomonas fluorescens strain Pf0-1 | Proteobacteria:Gammaproteobacteria |
Pseudonocardia dioxanivorans strain CB1190 | Actinobacteria:Actinobacteridae |
Ralstonia eutropha strain H16 | Proteobacteria:Betaproteobacteria |
Rhizobium tropici strain CIAT 899 | Proteobacteria:Alphaproteobacteria |
Rhodobacter sphaeroides ATCC 17029 | Proteobacteria:Alphaproteobacteria |
Shigella boydii strain Sb227 | Proteobacteria:Gammaproteobacteria |
Slackia heliotrinireducens strain DSM 20476 | Actinobacteria:Coriobacteridae |
Sorangium cellulosum strain So0157-2 | Proteobacteria:Deltaproteobacteria |
Staphylococcus aureus strain 04-02981 | Firmicutes:Bacillales |
Streptococcus agalactiae strain 2603V/R | Firmicutes:Lactobacillales |
Streptomyces cf. griseus strain XylebKG-1 | Actinobacteria:Actinobacteridae |
Streptosporangium roseum strain DSM 43021 | Actinobacteria:Actinobacteridae |
Sulfurimonas denitrificans DSM 1251 strain ATCC 33889 | Proteobacteria:Epsilonproteobacteria |
Thioalkalivibrio nitratireducens strain DSM 14787 | Proteobacteria:Gammaproteobacteria |
Thiobacillus denitrificans strain ATCC 25259 | Proteobacteria:Betaproteobacteria |
Treponema azotonutricium strain ZAS-9 | Spirochaetes:Spirochaetales |
Treponema pedis strain T A4 | Spirochaetes:Spirochaetales |
Turneriella parva strain DSM 21527 | Spirochaetes:Spirochaetales |
Vibrio cholerae strain BX 330286 | Proteobacteria:Gammaproteobacteria |
Wolbachia endosymbiont strain TRS of Brugia malayi | Proteobacteria:Alphaproteobacteria |
Yersinia pestis D106004 | Proteobacteria:Gammaproteobacteria |
Bacillus thuringiensis serovar andalousiensis strain BGSC 4AW1 | Firmicutes:Bacillales |
Ureaplasma urealyticum serovar 5 strain ATCC 27817 | Tenericutes:Mollicutes |
Bordetella pertussis strain 18323 | Proteobacteria:Betaproteobacteria |
Comamonas testosteroni strain KF-1 | Proteobacteria:Betaproteobacteria |
Eikenella corrodens strain ATCC 23834 | Proteobacteria:Betaproteobacteria |
Janthinobacterium sp. strain Marseille | Proteobacteria:Betaproteobacteria |
Rhodopirellula baltica SH strain 1 | Planctomycetes:Planctomycetacia |
Blastopirellula marina strain DSM 3645 | Planctomycetes:Planctomycetacia |
Labels:
AG1,
bioinformatics,
codons,
desktop science,
DNA,
GC3,
genome size,
purine ratio,
Sorangium
Subscribe to:
Posts (Atom)