Showing posts with label antisense. Show all posts
Showing posts with label antisense. Show all posts

Thursday, May 22, 2014

Which Direction Does the Gene Point?

A maddening problem in genome annotation is determining the "sense" strand for a gene, especially when the gene is short and/or the genome has a high GC content (and thus contains few or no stop codons in reverse translation). To convince yourself this is a very real and serious problem, all you have to do is browse a few genomes to see the ridiculously high number of "hypothetical proteins" (over 40% in some genomes), bogus overlaps, genes that score BLAST hits in reverse-complement mode but not frame zero, and other artifacts that are a direct result of the aforesaid problem.

I've presented examples of this problem before, but just so there's no confusion, I want to show you a particularly maddening example so you can see what I'm talking about. (Then I'll suggest a solution.) The following graphic shows a region of E. coli UTI89 in which several genes are shown as overlapping (that is to say, existing on opposite strands of DNA in the same coverage area). Small overlaps sometimes happen between genes, but whole genes rarely, if ever, overlap, and never in clusters. The situation shown below is bogus, but you see it all the time in public genomes. In fact, some of the genes shown below also show up as overlaps in Mycobacterium abscessus M93 (see gene OUW_18941), Citrobacter koseri strain ATCC BAA-895 (gene CKO_00072), and quite a few others. Glimmer has choked on this exact situation many times, in many genomes.  

A region with overlapping genes in the genome of E. coli UTI89.

The big gene on the top strand, middle, is UTI89_C4288 (DNA sequence here). It's annotated as (what else?) a "hypothetical protein." The M. abscessus version of the gene (here) is marked as a "cellobiose phosphorylase," and you can find many BLAST hits (the Rothia version gives an E-value of 6.0×10-101) for similar "cellobiose phosphorylase" genes in other organisms at UniProt.org and elsewhere. Of course, they're all bogus and represent Glimmer choke points, but the question is how one can determine that, and be sure about it.

E. coli's hypothetical protein (UTI89_C4288) has a wobble-base GC percentage of 64.3%, whereas the gene on the opposite strand (just below it, pointing left), namely UTI89_C4287 (marked as "membrane-bound ATP synthase F1 sector alpha-subunit"), has GC3 = 54.5%. In a much higher-GC organism like Mycobacterium or Pseudomonas, you would find out which gene has the higher GC3 percentage and crown it the winner (and most of the time, you'd be right). In this case, it's not so simple. The gene with the higher GC3 value isn't necessarily the winner.

Of course, in this particular example, you can cheat and look at the identities of the genes in the immediate vicinity of the hypothetical protein, on the bottom strand, and if you do, you'll find that all of the bottom-strand genes are ATPase subunits. Mystery solved, right? Sure, in this particular case. But what about situations where overlapping genes are all shown as "hypothetical protein"? (You can find many such cases in the genome for Burkholderia pseudomallei strain 1710b, for example.) When a hypothetical overlaps a hypothetical in a low-GC genome, then what?

One of my favorite cheats (but this isn't the final solution!) is to check the gene's AG1 percentage (adenine plus guanine, codon base one). This percentage averages ~60% in something like 90% of protein-coding genes. The problem is, AG1 is often 60% whether you read the gene forward, or backward (off the antisense strand). The reverse complement of a gene usually has high AG1, because the forward AG3 is usually under 50%.

Almost any trick you can dream up will fail under edge cases. GC3 is helpful, but only in high-GC genomes. AG1 is helpful, but only sometimes. Shine Dalgarno signals are not universally used by all organisms, and even in those that do use them, they're usually reserved for highly conserved genes encoding things like ribosomal proteins. Gene context is helpful in some cases but not others.

It turns out, the best clue for positively identifying the correct strand and correct reading frame is codon usage frequency patterns. If you know what the codon frequencies are, genome-wide, for a given organism, you can use this information to good advantage, even if the genome (and therefore the codon table) contains inaccuracies. As long as the codon frequencies are approximately correct, you can use them to verify the reading frame of a protein-coding gene.

The algorithm I came up with is very simple, yet effective. For a given gene, read each triplet of bases sequentially, and score each triplet twice: keep two scores going. First, score it according to its frequency in the codon table for the organism. Then score it according to a second table developed for reverse-complement codons.

The following table shows codon frequencies in Caulobacter crescentus NA1000. If you were to encounter a "hypothetical protein" gene in Caulobacter, and you couldn't decide whether the strand assignment was correct or not, first develop a score for the gene by reading its triplets and adding the frequency value of the corresponding codon to the running total. For example, if you encounter the triplet "CTG," add 6.84 to the score (see table). For every occurrence of CTT, add 0.60, for CTC add 1.70, and so on, using the values in the table.

Codon frequencies for Caulobacter crescentus NA1000.
But you also have to create an anticodon frequency table as follows: For every codon in the original table, apply the same score to the corresponding reverse-complement codon in the "antcodon table." E.g., for CTG, the first table would contain 6.84 (as above), but the value 6.84 would apply to CAG (the reverse complement of CTG) in the second table. I call the first table the "forward" table and the second table the "back" table. One represents the frequencies of codons encountered in protein genes in the forward reading direction. The other represents those same frequencies applied to the reverse-complement of the codons (the same codons read in the reverse direction, off the opposite DNA strand).

When scoring an unknown gene, you tally a "forward table" score, and keep a separate score using the "back" table. When you're done, the gene's "forward table" score should be greater than the "back table" score. If it's not, you're reading the gene off the wrong strand.

When I scored all 3,737 C. crescentus CB15 genes using this technique, I found 136 genes that gave a "back" score higher than the "forward" score. Interestingly, when I checked the identities of those 136 putative "backwards-annotated" genes, 132 of them were listed as "hypothetical proteins." Only four genes with assigned functions gave suspect scores, and one of those (CC_0662) turns out to be a 100%-identity match for the reverse of gene CCNA_00700 in Caulobacter crescenstus NA1000.  The other three are less than 200 bases long and could well be non-coding regions.

For a more challenging test, I turned to the genome of Rothia mucilaginosa DY-18, one of the most disastrous annotation nightmares of all time. In the genome for DY-18 you will find 524 protein-coding genes (out of 1,905 total) that are involved in significant overlaps. (Some overlaps are 2-on-1, some are 1-on-1; but the genome is almost certainly overannotated by at least 260 genes.) I trained my program on the codon usage table of R. mucilaginosa M508 (which contains fewer overlaps than DY-18), then tallied codon and anticodon scores on all of DY-18's CDS genes. In the end, 276 genes gave scores indicative of a reversed reading frame. Of those, 265 were, in fact, involved in overlaps.

Codon scoring is such an effective method, I don't know why programs like Glimmer don't use it. It's quite obvious they're not using it, though, because every genome has reverse-annotated genes (by the hundreds, in some cases) that are easily detected using this simple method.

Here, for the record, are the Caulobacter crescentus CB15 genes that appear to be annotated on the wrong strand:

CC_0023
CC_0048
CC_0073
CC_0099
CC_0149
CC_0354
CC_0480
CC_0546
CC_0564
CC_0605
CC_0662
CC_0666
CC_0676
CC_0677
CC_0680
CC_0681
CC_0687
CC_0728
CC_0739
CC_0775
CC_0782
CC_0786
CC_0825
CC_0850
CC_0853
CC_0913
CC_0987
CC_0996
CC_0997
CC_1020
CC_1022
CC_1031
CC_1032
CC_1050
CC_1069
CC_1073
CC_1084
CC_1094
CC_1123
CC_1127
CC_1161
CC_1174
CC_1212
CC_1222
CC_1238
CC_1245
CC_1274
CC_1312
CC_1322
CC_1340
CC_1349
CC_1392
CC_1393
CC_1394
CC_1395
CC_1414
CC_1416
CC_1513
CC_1561
CC_1648
CC_1789
CC_1793
CC_2000
CC_2086
CC_2116
CC_2163
CC_2184
CC_2193
CC_2240
CC_2256
CC_2308
CC_2334
CC_2338
CC_2351
CC_2376
CC_2413
CC_2424
CC_2442
CC_2445
CC_2450
CC_2452
CC_2471
CC_2475
CC_2499
CC_2519
CC_2525
CC_2571
CC_2574
CC_2597
CC_2602
CC_2621
CC_2624
CC_2665
CC_2698
CC_2705
CC_2718
CC_2719
CC_2720
CC_2731
CC_2732
CC_2738
CC_2739
CC_2756
CC_2769
CC_2800
CC_2850
CC_2865
CC_2875
CC_2878
CC_2907
CC_2916
CC_2949
CC_3050
CC_3055
CC_3251
CC_3302
CC_3318
CC_3342
CC_3360
CC_3429
CC_3437
CC_3438
CC_3451
CC_3453
CC_3463
CC_3479
CC_3517
CC_3519
CC_3547
CC_3548
CC_3553
CC_3554
CC_3608
CC_3665
CC_3671
CC_3700

Sunday, May 11, 2014

Making Sense of Antisense

One reason genomes are so poorly annotated is that annotation software (of the Glimmer variety) gets easily confused by high-GC-content genome data. When a genome is high in guanine and cytosine, relatively few stop codons are present in alternate reading frames. (Recall that DNA is read in triplets of letters, called codons: AAA, AAC, AAT, AGT, etc. There are 6 possible reading frames for any given segment of DNA, representing 3 forward reading frames and 3 backward frames.) Stop codons (TGA, TAG, TAA) are mostly composed of A and T, not G and C. But also, it so happens that most protein-coding genes follow a certain pattern of codon construction. The first base in a 3-letter codon is usually A or G (about 60% of the time, in most genes, in most organisms). The second base is highly variable in all respects. The third base is usually reflective of overall genome composition: If the genome is high in G and C, the third base of each codon will tend to be G or C. (This happens almost all the time in Streptomyces, for example, where the third base G+C content is 97%.) If the genome is high in A and T, the third codon base will be high in A and T.

What's perhaps unexpected is that the same compositional pattern can sometimes work in reverse, on the complementing DNA strand. When you look at a protein's codons and see that 60% use a purine in the first base and but only 40% use a purine in the third base, this means that the reverse complement of the codon also has 60% purine content in base one and 40% purine content in base three. Example: the codon GCT (alanine) has a purine (G) followed by a pyrimidine (C) followed by a pyrimidine. The reverse complement codon, AGC (serine) also begins with a purine (A) and ends with a pyrimidine (C). This type of symmetry tends to be a confounding factor for programs like Glimmer that try to distinguish sense from antisense strands and coding from non-coding regions, and normal reading frames from nonsense frames.

Perhaps some real-world data will make this clearer. Below is a plot of AG1 (purine content at base one) versus GC1 (guanine plus cytosine, base one) for all codons of all genes of the soil bacterium Pseudomonas fluorescens PF0-1. Each point represents one gene's worth of data. For each data point, I simply went through all of that gene's codons and tallied up the A, G, C, and T at each base position, then found the average AG1 and GC1 for the gene in question. I did this for all 5,722 protein-coding genes in the genome. (Don't worry. Scripts do the whole thing in the blink of an eye. It takes less than 10 milliseconds to process one gene's worth of data.) Notice how the points cluster at y=0.6, meaning most genes have an average AG1 (first-base purine) content of around 60%.

AG1 (purine content, base one) versus overall G+C content, for N=5722 protein-coding genes of Pseudomonas fluorescens PF0-1. Each dot represents statistics for one gene. Notice that the dots tend to cluster at about y=0.6.

Now have a look at the graph below, which is the same kind of plot except we're looking at data for the third codon base. Here, the median y-value is only 0.453, meaning that purine content averages around 45% in the third base. That means on the opposite strand, in the same position, there's a purine ~55% of the time.
AG3 (purines, base three) versus G+C for all protein-coding genes (N=5722) in P. fluorescens PF0-1. The median y-value is 0.453. The long comet tail in the direction of low G+C could be due to "AT drift." It could also partly be due to incorrect annotation of genes as to "sense" versus "antisense" strands.

I ran some numbers and found that in P. fluorescens, 74% of genes have codons that are purine-heavy on the front (AG1 greater than 55%), in the normal reading direction, but most genes (53%) are also purine-heavy in base one when read in the reverse-complement sense. In fact, for every three protein genes that have AG1 greater than 55% and GC3 above 60% in the normal reading frame, there are two genes that meet the same criteria when translated in the reverse-complement frame. This means that for quite a few genes in Pseudomonas, the normal reading frame has similar compositional statistics to the reverse reading frame. (Note that codon base two tends to average 48% purines in the forward direction and 52% in the back direction.) To a program like Glimmer, many protein genes look surprisingly similar whether read from the sense strand of DNA or the antisense strand. Distinguishing sense from antisense is not trivial, in other words, although in an upcoming post I'll talk about a way to do it via codon bias.

To get a better idea of how universal this bidirectional codon symmetry might be, I obtained the codon statistics for 109 organisms and calculated the average base-3 composition stats, and came up with the following graph that plots AG3 (purines, base three) against overall genome A+T content:
Codon base three purine content versus genome A+T content for N=109 bacterial species. Every point represents aggregate codon stats for one organism. The size of each circle is proportional to genome size of the organism.

The size of each circle is proportional to the genome size of the organism in question. (As you can see, organisms with high genomic A+T content often tend to have smaller genomes.) Notice that base three tends to be a pyrimidine (AG3 less than 50%), in the normal reading frame, for almost all organisms. (That means it's a purine in the reverse reading frame.) A few circles appear above y=0.5, but not many, and not by much.

Bottom line: Distinguishing sense from antisense DNA is not a straightforward matter, since many codons have similar composition statistics in forward and backward frames. Breaking the deadlock might require finding a Shine Dalgarno sequence for one frame but not the other, or it could mean running a homology check (via BLAST) against similar genes in a database, but in that case you have to hope the strand assignment was correct in the database genes (which it often is not).

Incidentally, I know of no a priori reason why the wobble base should accumulate pyrimidines preferentially (even though that's what the data say). In theory, base three is (for most codons) a degenerate position and should be neutral (free to accumulate bases of any type). We know that base one tends to be a purine 60% of the time. The fact that base three is a pyrimidine 54.7% of the time is suspicious (arguably) and tends to imply that some genes are annotated backwards. As we'll see in a future post, the backwards-annotation problem isn't a huge issue in most organisms, but it's not negligible, either.

Thursday, May 01, 2014

A Strange Codon Symmetry

Codons have a peculiar symmetry property that's not much discussed, which is that if you look at codon usage in protein genes for an organism, every codon occurs at roughly the same frequency as its reverse-complement pair. For example, the codon GCT (alanine) has a reverse complement of AGC (serine). If GCT occurs at a high rate, AGC will occur at a similar high rate. If one is low in frequency, the other will be low too. I've written about this before, but I want to return to it one more time, because it's bizarre and crazy and deserves an explanation, and I'm hard pressed to come up with one.
Usage frequencies for codons in Ktedonobacter racemifer. Click to enlarge.

This correlation tends to be highest in organisms with genomes that have a high G+C content and lowest for organisms that have low G+C content. The Pearson coefficient ranges from about 0.8 (high GC) to 0.3 (low GC).

I've never heard a good explanation for this phenomenon. I like to think aliens aren't to blame, though.

One possible explanation is that many protein genes originated, long ago, as antisense proteins, following a gene duplication event. If a protein is made from the plus strand of gene X and the corresponding antiprotein is made from the minus strand of the same gene (X'), the X' copy will have different amino acids from the normal copy (of course) but the codon/anticodon usage rates will be the same. 

Therefore, what we may be looking at is an echo (a reverse complement signal) from the distant past.

Another possibility is that antisense regions become active in the process of gene duplications. Suppose a gene (Gene1) gets copied, but the copy (Gene2) gets clipped during the duplication. In its new location on the genome, Gene2 will exist without its original stop codon. The nearest naturally occurring stop codon may be 60 base pairs downstream. But underneath that 60-bp region might be another gene (Gene3) on the opposite strand. Ultimately, the copied gene overlaps the gene underneath it by 60 bp:

Gene1 Gene2
----> ---->
        <-----
         Gene3

This is called a convergent overlap, and such overlaps are common in nature. They're seldom longer than 100 base pairs (and are often just a few base pairs long). Divergent overlaps also occur.

Any kind of overlap will mean that an "antisense" signal will enter the codon pool.

One thing is for sure: The correlation between occurrence rates of codons and their reverse-complements is too strong to be due to chance.

A dramatic example of the kind of symmetry I'm talking about comes by way of a monster bacterium known as Ktedonobacter racemifer DSM 44963, which is a monster in the sense of sheer genome size: It has a 13.6-million-base-pair genome encoding a whopping 11,540 putative genes (plus 1,178 pseudogenes). The codon usage frequencies are shown in the graphic further above. Each codon is shown with the corresponding reverse-complement codon. The length of the bars corresponds to overall usage frequency (across all proteins in the genome). Frequencies for codon/reverse-complement pairs correlate strongly (r=0.799); too strongly to be by chance (p < 001).

Out of curiosity, I checked the codon frequencies for yeast (Saccharomyces cerevisiae strain DBVPG1106), and the same sort of relationship (though not as strong) occurs:


Again, to a first approximation, the frequency of a codon is dictated by the frequency of its reverse-complement cousin. It's interesting that the relationship is still visibly apparent even though Saccharomyces has a coding-region GC content of just 37.4%.

Saturday, April 19, 2014

Codons and Reverse Complement Codons

A very unusual and surprising property of protein-coding genes is that if a codon A appears with a certain frequency in genes, the reverse-complement codon of A will also have a similar frequency of occurrence. For example: If CTT (leucine) appears at a frequency of 1%, the reverse complement codon AAG (lysine) will also appear at roughly 1%. If CGT (arginine) appears at 0.2%, ACG (threonine) will appear at around 0.2%. (These are whole-genome frequencies.)

This correlation is strongest (r=0.75) for organisms with a high genomic G+C content, such as Streptomyces griseus, and lowest (r=0.28) in low-GC organisms like Clostridium botulinum.

This is a very peculiar property, when you think about it. We don't usually imagine an organism being constrained in its choice of codons for a particular protein. If a particular protein calls for a huge amount of leucines (CTTCTTCTT) we don't imagine that there's a requirement for an equivalent quantity of AAG to be used somewhere else. And yet, the correlation between frequency-of-occurrence of a codon and its antisymmetric twin is, as I say, surprisingly high in many organisms.

This sort of thing is very hard to explain without invoking a theory of proteogenesis that involves antisense proteins. Imagine a poly-lysine gene of AAA repeated 100 times. The gene gets duplicated on the opposite strand. Now the original strand has 100 AAAs and a run of 100 TTTs. If a reading frame opens up on the TTT stretch (and the protein is beneficial to the organism; it survives), there is now codon/anticodon parity of the kind I'm describing, between codons in the poly-lysine gene and the poly-phenylalanine (TTT) gene.

Why does this relationship hold for high-GC organisms but not as much for low-GC organisms? Probably because antisense genes in high-AT organisms contain a lot of stop codons (TAA, TGA, TAG, which by the way occur at about the same frequencies as TTA, TCA, and CTA, respectively). The presence of few stop codons in high-GC antisense genes gives those genes a chance to be expressed and evolve further. Of course, if you buy this theory, it tends to argue for a "GC World"  scenario in which the early proteosome evolved from GC-rich double-stranded genomes.

To illustrate the unusual correlation I'm talking about, I took the codon frequencies of Pseudomonas fluorescens PF01 (genome-wide) and made a graph that plots the frequency of occurrence of each codon on the x-axis, versus the frequency of occurrence of the corresponding reverse-complement codon on the y-axis. (So if CTA occurs at 0.3% and TAG occurs at 0.2%, I plot a point at [0.3,  0.2].) The SVG graph (below) is interactive: You should be able to hover over a point and see a tooltip that shows the identity of the corresponding codon, and its reverse twin, and their respective frequencies.

NOTE: If your browser does not support SVG, a PNG copy of the graph is here.

The symmetry pattern is expected: For every codon/anticodon there's a corresponding anticodon/codon pair with frequencies swapped. What's more important than the symmetry pattern is the fact that frequency values in Y increase monotonically in X and vice versa, with a correlation coefficient in this case of r=0.63 (F-statistic 41, p < .001). This means that codons tend to occur at about the same frequencies as their reverse complement codons. There are outliers, to be sure, but the overall trend is statistically solid.

Leave a comment if you have any thoughts on what's going on here.