blogorrhea: GC3

Showing posts with label GC3. Show all posts

Sunday, May 11, 2014

Making Sense of Antisense

One reason genomes are so poorly annotated is that annotation software (of the Glimmer variety) gets easily confused by high-GC-content genome data. When a genome is high in guanine and cytosine, relatively few stop codons are present in alternate reading frames. (Recall that DNA is read in triplets of letters, called codons: AAA, AAC, AAT, AGT, etc. There are 6 possible reading frames for any given segment of DNA, representing 3 forward reading frames and 3 backward frames.) Stop codons (TGA, TAG, TAA) are mostly composed of A and T, not G and C. But also, it so happens that most protein-coding genes follow a certain pattern of codon construction. The first base in a 3-letter codon is usually A or G (about 60% of the time, in most genes, in most organisms). The second base is highly variable in all respects. The third base is usually reflective of overall genome composition: If the genome is high in G and C, the third base of each codon will tend to be G or C. (This happens almost all the time in Streptomyces, for example, where the third base G+C content is 97%.) If the genome is high in A and T, the third codon base will be high in A and T.

What's perhaps unexpected is that the same compositional pattern can sometimes work in reverse, on the complementing DNA strand. When you look at a protein's codons and see that 60% use a purine in the first base and but only 40% use a purine in the third base, this means that the reverse complement of the codon also has 60% purine content in base one and 40% purine content in base three. Example: the codon GCT (alanine) has a purine (G) followed by a pyrimidine (C) followed by a pyrimidine. The reverse complement codon, AGC (serine) also begins with a purine (A) and ends with a pyrimidine (C). This type of symmetry tends to be a confounding factor for programs like Glimmer that try to distinguish sense from antisense strands and coding from non-coding regions, and normal reading frames from nonsense frames.

Perhaps some real-world data will make this clearer. Below is a plot of AG1 (purine content at base one) versus GC1 (guanine plus cytosine, base one) for all codons of all genes of the soil bacterium Pseudomonas fluorescens PF0-1. Each point represents one gene's worth of data. For each data point, I simply went through all of that gene's codons and tallied up the A, G, C, and T at each base position, then found the average AG1 and GC1 for the gene in question. I did this for all 5,722 protein-coding genes in the genome. (Don't worry. Scripts do the whole thing in the blink of an eye. It takes less than 10 milliseconds to process one gene's worth of data.) Notice how the points cluster at y=0.6, meaning most genes have an average AG1 (first-base purine) content of around 60%.

AG1 (purine content, base one) versus overall G+C content, for N=5722 protein-coding genes of Pseudomonas fluorescens PF0-1. Each dot represents statistics for one gene. Notice that the dots tend to cluster at about y=0.6.

Now have a look at the graph below, which is the same kind of plot except we're looking at data for the third codon base. Here, the median y-value is only 0.453, meaning that purine content averages around 45% in the third base. That means on the opposite strand, in the same position, there's a purine ~55% of the time.

AG3 (purines, base three) versus G+C for all protein-coding genes (N=5722) in P. fluorescens PF0-1. The median y-value is 0.453. The long comet tail in the direction of low G+C could be due to "AT drift." It could also partly be due to incorrect annotation of genes as to "sense" versus "antisense" strands.

I ran some numbers and found that in P. fluorescens, 74% of genes have codons that are purine-heavy on the front (AG1 greater than 55%), in the normal reading direction, but most genes (53%) are also purine-heavy in base one when read in the reverse-complement sense. In fact, for every three protein genes that have AG1 greater than 55% and GC3 above 60% in the normal reading frame, there are two genes that meet the same criteria when translated in the reverse-complement frame. This means that for quite a few genes in Pseudomonas, the normal reading frame has similar compositional statistics to the reverse reading frame. (Note that codon base two tends to average 48% purines in the forward direction and 52% in the back direction.) To a program like Glimmer, many protein genes look surprisingly similar whether read from the sense strand of DNA or the antisense strand. Distinguishing sense from antisense is not trivial, in other words, although in an upcoming post I'll talk about a way to do it via codon bias.

To get a better idea of how universal this bidirectional codon symmetry might be, I obtained the codon statistics for 109 organisms and calculated the average base-3 composition stats, and came up with the following graph that plots AG3 (purines, base three) against overall genome A+T content:

Codon base three purine content versus genome A+T content for N=109 bacterial species. Every point represents aggregate codon stats for one organism. The size of each circle is proportional to genome size of the organism.

The size of each circle is proportional to the genome size of the organism in question. (As you can see, organisms with high genomic A+T content often tend to have smaller genomes.) Notice that base three tends to be a pyrimidine (AG3 less than 50%), in the normal reading frame, for almost all organisms. (That means it's a purine in the reverse reading frame.) A few circles appear above y=0.5, but not many, and not by much.

Bottom line: Distinguishing sense from antisense DNA is not a straightforward matter, since many codons have similar composition statistics in forward and backward frames. Breaking the deadlock might require finding a Shine Dalgarno sequence for one frame but not the other, or it could mean running a homology check (via BLAST) against similar genes in a database, but in that case you have to hope the strand assignment was correct in the database genes (which it often is not).

Incidentally, I know of no a priori reason why the wobble base should accumulate pyrimidines preferentially (even though that's what the data say). In theory, base three is (for most codons) a degenerate position and should be neutral (free to accumulate bases of any type). We know that base one tends to be a purine 60% of the time. The fact that base three is a pyrimidine 54.7% of the time is suspicious (arguably) and tends to imply that some genes are annotated backwards. As we'll see in a future post, the backwards-annotation problem isn't a huge issue in most organisms, but it's not negligible, either.

Tuesday, April 29, 2014

Genes, Codons, and Purines

A universal feature of protein-coding genes is that they tend to use a lot of codons that begin with a purine (A or G). In fact, it's typical for a given gene's codons to use a purine in position one 60% or more of the time. But it's fair to ask: How universal is this trend, exactly? Does the rule apply for organisms with extremely high (or low) genomic G+C content? Does it apply for endosymbionts with greatly reduced genomes? Is it just a "sometimes" rule? Are there important exceptions?

I decided to collect codon statistics for 109 different bacterial species, representing members of all major taxonomic groups, with a wide range of genome sizes and GC percentages. For each organism, I determined the average percent A+G content in codon base one (AG1) across all CDS genes. Then I plotted AG1 against the genomic A+T content for each organism. (A+T is of course just one minus the G+C content.) Here's the graph of AG1 content for all the organisms:

Codon base-one purine content (average for all CDS genes) versus genomic A+T content for N=109 bacterial species. Dot size corresponds to genome size.

The fun thing about this graph is that each data point is sized according to the genome size of the organism in question (in other words, the area of the dot is proportional to genome size). As you can see, bacteria at the high end of the A+T scale (low G+C) tend to have smaller genomes. But the more important thing to notice is that AG1 is 58% or more for all 109 genomes. This means that the phenomenon of high average purine content in codon base one appears to be universal, at least for the sample group. (Organism names are listed in a table below.)

Of course, within a given genome, genes vary somewhat in terms of the per-gene average AG1, but it's still quite rare to find a protein gene that has average AG1 under 50%. For example, below is a histogram plot of AG1 content for all protein-coding genes of Sorangium cellulosum, a bacterium with genomic GC content of 72% (A+T = 28%).

Per-gene AG1 usage (codon base-one purine content) for all CDS genes of Sorangium cellulosum.

As you can see, very few genes lie to the left of x = 0.5. (Of Sorangium's 10,400 protein genes, only 321 have an average AG1 under 50%. Those could easily be mis-annotated genes or gene fragments.) Most organisms show much the same distribution of average AG1 values across CDS genes.

Gene annotation programs could probably benefit from using a check of AG1 to verify that a putative gene is in the correct reading frame. GC3 content is often used in this way, but AG1 is actually a much more discriminating test, especially with low-GC genomes (where the "wobble base" GC percentage is not particularly helpful).

Listed below are the 109 organisms (and their taxonomic categorizations) used in this investigation.

Organism	Taxon
Acidaminococcus fermentans strain DSM 20731	Firmicutes:Clostridia
Acidovorax avenae subsp. citrulli strain AAC00-1	Proteobacteria:Betaproteobacteria
Aerococcus urinae strain ACS-120-V-Col10a	Firmicutes:Lactobacillales
Aeromonas hydrophila strain ML09-119	Proteobacteria:Gammaproteobacteria
Aggregatibacter actinomycetemcomitans D11S-1	Proteobacteria:Gammaproteobacteria
Agrobacterium radiobacter strain K84	Proteobacteria:Alphaproteobacteria
Anaerobaculum mobile strain DSM 13181	Synergistetes:Synergistia
Anaerocellum thermophilum strain DSM 6725	Firmicutes:Clostridia
Anaerolinea thermophila strain UNI-1	Chloroflexi:Anaerolineae
Anaplasma marginale strain Florida	Proteobacteria:Alphaproteobacteria
Arcobacter butzleri ED-1	Proteobacteria:Epsilonproteobacteria
Atopobium vaginae strain DSM 15829	Actinobacteria:Coriobacteridae
Azospirillum brasilense strain Sp245	Proteobacteria:Alphaproteobacteria
Bacillus amyloliquefaciens strain Y2	Firmicutes:Bacilli
Bacillus anthracis strain CDC 684	Firmicutes:Bacillales
Bacillus subtilis BEST7613 strain PCC 6803	Firmicutes:Bacilli
Bacteroides dorei strain 5_1_36/D4	Bacteroidetes:Bacteroidia
Bartonella quintana strain RM-11	Proteobacteria:Alphaproteobacteria
Blastococcus saxobsidens strain DD2	Actinobacteria:Actinobacteridae
Borrelia miyamotoi strain LB-2001	Spirochaetes:Spirochaetales
Brachybacterium faecium strain DSM 4810	Actinobacteria:Actinobacteridae
Brucella ovis strain ATCC 25840	Proteobacteria:Alphaproteobacteria
Buchnera aphidicola (Acyrthosiphon pisum) strain 5A	Proteobacteria:Gammaproteobacteria
Burkholderia pseudomallei strain 1710b	Proteobacteria:Betaproteobacteria
Caldicellulosiruptor lactoaceticus strain 6A	Firmicutes:Clostridia
Calditerrivibrio nitroreducens strain DSM 19672	Deferribacteres:Deferribacterales
Campylobacter concisus strain 13826	Proteobacteria:Epsilonproteobacteria
Candidatus Cloacamonas acidaminovorans	candidate division WWE1:Candidatus Cloacamonas
Candidatus Methylomirabilis oxyfera	candidate division NC10:Candidatus Methylomirabilis
Candidatus Pelagibacter ubique strain HTCC1062	Proteobacteria:Alphaproteobacteria
Carboxydothermus hydrogenoformans strain Z-2901	Firmicutes:Clostridia
Chlamyda trachomatis (i) strain L2/434/Bu; i	Chlamydiae:Chlamydiales
Clostridium botulinum A strain Hall	Firmicutes:Clostridia
Coprobacillus sp. strain 8_2_54BFAA	Firmicutes:Erysipelotrichia
Coprococcus catus strain GD/7	Firmicutes:Clostridia
Cycloclasticus zancles strain 7-ME	Proteobacteria:Gammaproteobacteria
Deinococcus radiodurans strain R1	Deinococcus-Thermus:Deinococci
Desulfococcus oleovorans strain Hxd3	Proteobacteria:Deltaproteobacteria
Ehrlichia canis strain Jake	Proteobacteria:Alphaproteobacteria
Enterobacter cloacae strain SCF1	Proteobacteria:Gammaproteobacteria
Erwinia amylovora strain ATCC 49946	Proteobacteria:Gammaproteobacteria
Escherichia coli B strain REL606	Proteobacteria:Gammaproteobacteria
Geobacillus kaustophilus strain HTA426	Firmicutes:Bacillales
Geobacillus thermoleovorans strain CCB_US3_UF5	Firmicutes:Bacillales
Geobacter metallireducens strain GS-15	Proteobacteria:Deltaproteobacteria
Geobacter sulfurreducens strain KN400	Proteobacteria:Deltaproteobacteria
Geobacter sulfurreducens strain PCA	Proteobacteria:Deltaproteobacteria
Geobacter uraniireducens strain Rf4	Proteobacteria:Deltaproteobacteria
Geodermatophilus obscurus strain DSM 43160	Actinobacteria:Actinobacteridae
Gordonia bronchialis strain DSM 43247	Actinobacteria:Actinobacteridae
Haemophilus ducreyi strain 35000HP	Proteobacteria:Gammaproteobacteria
Halogeometricum borinquense DSM 11551	Euryarchaeota:Halobacteria
Helicobacter pylori (Helicobacter pylori SAfr7) strain SouthAfrica7	Proteobacteria:Epsilonproteobacteria
Klebsiella oxytoca strain 10-5243	Proteobacteria:Gammaproteobacteria
Kribbella flavida strain DSM 17836	Actinobacteria:Actinobacteridae
Ktedonobacter racemifer DSM 44963	Chloroflexi:Ktedonobacteria
Lactobacillus acidophilus strain 30SC	Firmicutes:Lactobacillales
Lactobacillus reuteri strain MM4-1A	Firmicutes:Lactobacillales
Lactococcus lactis subsp. cremoris strain A76	Firmicutes:Bacilli
Leptolyngbya sp. PCC 7376	Cyanobacteria:Oscillatoriophycideae
Leptonema illini strain DSM 21528	Spirochaetes:Spirochaetales
Leptospira biflexa serovar Patoc strain Ames; Patoc 1	Spirochaetes:Spirochaetales
Leuconostoc gasicomitatum LMG 18811 strain type LMG 18811	Firmicutes:Lactobacillales
Mesorhizobium australicum strain WSM2073	Proteobacteria:Alphaproteobacteria
Mesorhizobium ciceri biovar biserrulae strain WSM1271	Proteobacteria:Alphaproteobacteria
Methylobacillus flagellatus strain KT	Proteobacteria:Betaproteobacteria
Methylophaga sp. strain JAM7	Proteobacteria:Gammaproteobacteria
Mycobacterium tuberculosis = ATCC 35801 strain ATCC35801; Erdman	Actinobacteria:Actinobacteridae
Mycoplasma gallisepticum strain F	Tenericutes:Mollicutes
Neisseria gonorrhoeae strain NCCP11945	Proteobacteria:Betaproteobacteria
Nocardia brasiliensis ATCC 700358 strain HUJEG-1	Actinobacteria:Actinobacteridae
Nocardia cyriacigeorgica strain GUH-2	Actinobacteria:Actinobacteridae
Nostoc sp. PCC 7120 (Anabaena sp. PCC 7120) strain PCC7120	Cyanobacteria:Nostocales
Novosphingobium aromaticivorans strain DSM 12444	Proteobacteria:Alphaproteobacteria
Oceanobacillus kimchii strain X50	Firmicutes:Bacilli
Orientia tsutsugamushi strain Ikeda	Proteobacteria:Alphaproteobacteria
Paenibacillus polymyxa strain M1	Firmicutes:Bacilli
Polynucleobacter necessarius strain STIR1	Proteobacteria:Betaproteobacteria
Propionibacterium acnes TypeIA2 strain P.acn33	Actinobacteria:Actinobacteridae
Proteus mirabilis strain HI4320	Proteobacteria:Gammaproteobacteria
Pseudomonas fluorescens strain Pf0-1	Proteobacteria:Gammaproteobacteria
Pseudonocardia dioxanivorans strain CB1190	Actinobacteria:Actinobacteridae
Ralstonia eutropha strain H16	Proteobacteria:Betaproteobacteria
Rhizobium tropici strain CIAT 899	Proteobacteria:Alphaproteobacteria
Rhodobacter sphaeroides ATCC 17029	Proteobacteria:Alphaproteobacteria
Shigella boydii strain Sb227	Proteobacteria:Gammaproteobacteria
Slackia heliotrinireducens strain DSM 20476	Actinobacteria:Coriobacteridae
Sorangium cellulosum strain So0157-2	Proteobacteria:Deltaproteobacteria
Staphylococcus aureus strain 04-02981	Firmicutes:Bacillales
Streptococcus agalactiae strain 2603V/R	Firmicutes:Lactobacillales
Streptomyces cf. griseus strain XylebKG-1	Actinobacteria:Actinobacteridae
Streptosporangium roseum strain DSM 43021	Actinobacteria:Actinobacteridae
Sulfurimonas denitrificans DSM 1251 strain ATCC 33889	Proteobacteria:Epsilonproteobacteria
Thioalkalivibrio nitratireducens strain DSM 14787	Proteobacteria:Gammaproteobacteria
Thiobacillus denitrificans strain ATCC 25259	Proteobacteria:Betaproteobacteria
Treponema azotonutricium strain ZAS-9	Spirochaetes:Spirochaetales
Treponema pedis strain T A4	Spirochaetes:Spirochaetales
Turneriella parva strain DSM 21527	Spirochaetes:Spirochaetales
Vibrio cholerae strain BX 330286	Proteobacteria:Gammaproteobacteria
Wolbachia endosymbiont strain TRS of Brugia malayi	Proteobacteria:Alphaproteobacteria
Yersinia pestis D106004	Proteobacteria:Gammaproteobacteria
Bacillus thuringiensis serovar andalousiensis strain BGSC 4AW1	Firmicutes:Bacillales
Ureaplasma urealyticum serovar 5 strain ATCC 27817	Tenericutes:Mollicutes
Bordetella pertussis strain 18323	Proteobacteria:Betaproteobacteria
Comamonas testosteroni strain KF-1	Proteobacteria:Betaproteobacteria
Eikenella corrodens strain ATCC 23834	Proteobacteria:Betaproteobacteria
Janthinobacterium sp. strain Marseille	Proteobacteria:Betaproteobacteria
Rhodopirellula baltica SH strain 1	Planctomycetes:Planctomycetacia
Blastopirellula marina strain DSM 3645	Planctomycetes:Planctomycetacia