I've presented examples of this problem before, but just so there's no confusion, I want to show you a particularly maddening example so you can see what I'm talking about. (Then I'll suggest a solution.) The following graphic shows a region of E. coli UTI89 in which several genes are shown as overlapping (that is to say, existing on opposite strands of DNA in the same coverage area). Small overlaps sometimes happen between genes, but whole genes rarely, if ever, overlap, and never in clusters. The situation shown below is bogus, but you see it all the time in public genomes. In fact, some of the genes shown below also show up as overlaps in Mycobacterium abscessus M93 (see gene OUW_18941), Citrobacter koseri strain ATCC BAA-895 (gene CKO_00072), and quite a few others. Glimmer has choked on this exact situation many times, in many genomes.
|A region with overlapping genes in the genome of E. coli UTI89.
The big gene on the top strand, middle, is UTI89_C4288 (DNA sequence here). It's annotated as (what else?) a "hypothetical protein." The M. abscessus version of the gene (here) is marked as a "cellobiose phosphorylase," and you can find many BLAST hits (the Rothia version gives an E-value of 6.0×10-101) for similar "cellobiose phosphorylase" genes in other organisms at UniProt.org and elsewhere. Of course, they're all bogus and represent Glimmer choke points, but the question is how one can determine that, and be sure about it.
E. coli's hypothetical protein (UTI89_C4288) has a wobble-base GC percentage of 64.3%, whereas the gene on the opposite strand (just below it, pointing left), namely UTI89_C4287 (marked as "membrane-bound ATP synthase F1 sector alpha-subunit"), has GC3 = 54.5%. In a much higher-GC organism like Mycobacterium or Pseudomonas, you would find out which gene has the higher GC3 percentage and crown it the winner (and most of the time, you'd be right). In this case, it's not so simple. The gene with the higher GC3 value isn't necessarily the winner.
Of course, in this particular example, you can cheat and look at the identities of the genes in the immediate vicinity of the hypothetical protein, on the bottom strand, and if you do, you'll find that all of the bottom-strand genes are ATPase subunits. Mystery solved, right? Sure, in this particular case. But what about situations where overlapping genes are all shown as "hypothetical protein"? (You can find many such cases in the genome for Burkholderia pseudomallei strain 1710b, for example.) When a hypothetical overlaps a hypothetical in a low-GC genome, then what?
One of my favorite cheats (but this isn't the final solution!) is to check the gene's AG1 percentage (adenine plus guanine, codon base one). This percentage averages ~60% in something like 90% of protein-coding genes. The problem is, AG1 is often 60% whether you read the gene forward, or backward (off the antisense strand). The reverse complement of a gene usually has high AG1, because the forward AG3 is usually under 50%.
Almost any trick you can dream up will fail under edge cases. GC3 is helpful, but only in high-GC genomes. AG1 is helpful, but only sometimes. Shine Dalgarno signals are not universally used by all organisms, and even in those that do use them, they're usually reserved for highly conserved genes encoding things like ribosomal proteins. Gene context is helpful in some cases but not others.
It turns out, the best clue for positively identifying the correct strand and correct reading frame is codon usage frequency patterns. If you know what the codon frequencies are, genome-wide, for a given organism, you can use this information to good advantage, even if the genome (and therefore the codon table) contains inaccuracies. As long as the codon frequencies are approximately correct, you can use them to verify the reading frame of a protein-coding gene.
The algorithm I came up with is very simple, yet effective. For a given gene, read each triplet of bases sequentially, and score each triplet twice: keep two scores going. First, score it according to its frequency in the codon table for the organism. Then score it according to a second table developed for reverse-complement codons.
The following table shows codon frequencies in Caulobacter crescentus NA1000. If you were to encounter a "hypothetical protein" gene in Caulobacter, and you couldn't decide whether the strand assignment was correct or not, first develop a score for the gene by reading its triplets and adding the frequency value of the corresponding codon to the running total. For example, if you encounter the triplet "CTG," add 6.84 to the score (see table). For every occurrence of CTT, add 0.60, for CTC add 1.70, and so on, using the values in the table.
|Codon frequencies for Caulobacter crescentus NA1000.
When scoring an unknown gene, you tally a "forward table" score, and keep a separate score using the "back" table. When you're done, the gene's "forward table" score should be greater than the "back table" score. If it's not, you're reading the gene off the wrong strand.
When I scored all 3,737 C. crescentus CB15 genes using this technique, I found 136 genes that gave a "back" score higher than the "forward" score. Interestingly, when I checked the identities of those 136 putative "backwards-annotated" genes, 132 of them were listed as "hypothetical proteins." Only four genes with assigned functions gave suspect scores, and one of those (CC_0662) turns out to be a 100%-identity match for the reverse of gene CCNA_00700 in Caulobacter crescenstus NA1000. The other three are less than 200 bases long and could well be non-coding regions.
For a more challenging test, I turned to the genome of Rothia mucilaginosa DY-18, one of the most disastrous annotation nightmares of all time. In the genome for DY-18 you will find 524 protein-coding genes (out of 1,905 total) that are involved in significant overlaps. (Some overlaps are 2-on-1, some are 1-on-1; but the genome is almost certainly overannotated by at least 260 genes.) I trained my program on the codon usage table of R. mucilaginosa M508 (which contains fewer overlaps than DY-18), then tallied codon and anticodon scores on all of DY-18's CDS genes. In the end, 276 genes gave scores indicative of a reversed reading frame. Of those, 265 were, in fact, involved in overlaps.
Codon scoring is such an effective method, I don't know why programs like Glimmer don't use it. It's quite obvious they're not using it, though, because every genome has reverse-annotated genes (by the hundreds, in some cases) that are easily detected using this simple method.
Here, for the record, are the Caulobacter crescentus CB15 genes that appear to be annotated on the wrong strand: