One reason genomes are so poorly annotated is that annotation software (of the Glimmer variety) gets easily confused by high-GC-content genome data. When a genome is high in guanine and cytosine, relatively few stop codons are present in alternate reading frames. (Recall that DNA is read in triplets of letters, called codons: AAA, AAC, AAT, AGT, etc. There are 6 possible reading frames for any given segment of DNA, representing 3 forward reading frames and 3 backward frames.) Stop codons (TGA, TAG, TAA) are mostly composed of A and T, not G and C. But also, it so happens that most protein-coding genes follow a certain pattern of codon construction. The first base in a 3-letter codon is usually A or G (about 60% of the time, in most genes, in most organisms). The second base is highly variable in all respects. The third base is usually reflective of overall genome composition: If the genome is high in G and C, the third base of each codon will tend to be G or C. (This happens almost all the time in Streptomyces, for example, where the third base G+C content is 97%.) If the genome is high in A and T, the third codon base will be high in A and T.
What's perhaps unexpected is that the same compositional pattern can sometimes work in reverse, on the complementing DNA strand. When you look at a protein's codons and see that 60% use a purine in the first base and but only 40% use a purine in the third base, this means that the reverse complement of the codon also has 60% purine content in base one and 40% purine content in base three. Example: the codon GCT (alanine) has a purine (G) followed by a pyrimidine (C) followed by a pyrimidine. The reverse complement codon, AGC (serine) also begins with a purine (A) and ends with a pyrimidine (C). This type of symmetry tends to be a confounding factor for programs like Glimmer that try to distinguish sense from antisense strands and coding from non-coding regions, and normal reading frames from nonsense frames.
Perhaps some real-world data will make this clearer. Below is a plot of AG1 (purine content at base one) versus GC1 (guanine plus cytosine, base one) for all codons of all genes of the soil bacterium Pseudomonas fluorescens PF0-1. Each point represents one gene's worth of data. For each data point, I simply went through all of that gene's codons and tallied up the A, G, C, and T at each base position, then found the average AG1 and GC1 for the gene in question. I did this for all 5,722 protein-coding genes in the genome. (Don't worry. Scripts do the whole thing in the blink of an eye. It takes less than 10 milliseconds to process one gene's worth of data.) Notice how the points cluster at y=0.6, meaning most genes have an average AG1 (first-base purine) content of around 60%.
Now have a look at the graph below, which is the same kind of plot except we're looking at data for the third codon base. Here, the median y-value is only 0.453, meaning that purine content averages around 45% in the third base. That means on the opposite strand, in the same position, there's a purine ~55% of the time.
I ran some numbers and found that in P. fluorescens, 74% of genes have codons that are purine-heavy on the front (AG1 greater than 55%), in the normal reading direction, but most genes (53%) are also purine-heavy in base one when read in the reverse-complement sense. In fact, for every three protein genes that have AG1 greater than 55% and GC3 above 60% in the normal reading frame, there are two genes that meet the same criteria when translated in the reverse-complement frame. This means that for quite a few genes in Pseudomonas, the normal reading frame has similar compositional statistics to the reverse reading frame. (Note that codon base two tends to average 48% purines in the forward direction and 52% in the back direction.) To a program like Glimmer, many protein genes look surprisingly similar whether read from the sense strand of DNA or the antisense strand. Distinguishing sense from antisense is not trivial, in other words, although in an upcoming post I'll talk about a way to do it via codon bias.
To get a better idea of how universal this bidirectional codon symmetry might be, I obtained the codon statistics for 109 organisms and calculated the average base-3 composition stats, and came up with the following graph that plots AG3 (purines, base three) against overall genome A+T content:
The size of each circle is proportional to the genome size of the organism in question. (As you can see, organisms with high genomic A+T content often tend to have smaller genomes.) Notice that base three tends to be a pyrimidine (AG3 less than 50%), in the normal reading frame, for almost all organisms. (That means it's a purine in the reverse reading frame.) A few circles appear above y=0.5, but not many, and not by much.
Bottom line: Distinguishing sense from antisense DNA is not a straightforward matter, since many codons have similar composition statistics in forward and backward frames. Breaking the deadlock might require finding a Shine Dalgarno sequence for one frame but not the other, or it could mean running a homology check (via BLAST) against similar genes in a database, but in that case you have to hope the strand assignment was correct in the database genes (which it often is not).
Incidentally, I know of no a priori reason why the wobble base should accumulate pyrimidines preferentially (even though that's what the data say). In theory, base three is (for most codons) a degenerate position and should be neutral (free to accumulate bases of any type). We know that base one tends to be a purine 60% of the time. The fact that base three is a pyrimidine 54.7% of the time is suspicious (arguably) and tends to imply that some genes are annotated backwards. As we'll see in a future post, the backwards-annotation problem isn't a huge issue in most organisms, but it's not negligible, either.