Saturday, May 10, 2014

The "Hypothetical Protein" Problem

Once a genome has been sequenced, identification of "coding" regions of DNA (and assingment of functions to those regions) is almost always done by software. The Glimmer (Gene Locator and Interpolated Markov Modeler) freeware package, available from Johns Hopkins University, is a widely used annotation system. It finds putative genes, identifies which strand they're on, locates start and stop codons, etc.

The technology is suprisingly good (Glimmer reputedly finds 99% of genes) but it also has serious limitations. Programs like Glimmer have a relatively easy time locating open reading frames in low-GC (low guanine, cytosine) genomes but have a more difficult time identifying the "sense" strand in high-GC genomes. Also, gene size matters: Glimmer is great at finding sizable genes but is less accurate with small genes. These shortcomings are not unique to Glimmer but apply to all gene-detection software that I'm aware of.

In almost every bacterial genome, 20% to 40% of genes cannot be identified as to function and are tagged "hypothetical protein." These genes tend to differ from other genes in various ways. Size, for example.

Gene size distribution for N=3412 genes of assigned function (blue) and N=793 "hypothetical protein" genes (red) in E. coli B.
The above graph shows the size distribution of protein-coding genes in E. coli B. Hypothetical protein genes (N=793) are shown in red; genes with an assigned function (N=3412) are in blue. You can see that at gene sizes below ~500 base pairs, genes tend (more often than not) to be labeled "hypothetical protein." In E. coli, hypothetical-protein genes also tend to have lower GC content in the third codon base (the so-called "wobble" base). Whereas GC3 has a median value of 56.8% in genes with assigned functions, GC3 has a median value of 51.6% in hypothetical proteins.

In most genes, codons tend to begin with a purine base (A or G) 60% of the time. In E. coli, it's 59.8% of the time. Genes in which AG1 is less than 50% are rare. In E coli, only 3.7% of function-assigned genes have AG1 under 50%, but 6.9% of hypothetical protein genes have AG1 under 50%, and hypotheticals are three times as likely to have AG1 (purine content, codon base one) less than AG2 (purine content, base two). The latter is an important "sanity check" of reading frame. One would only expect AG1 to be less than AG2 in genes with a frameshift error or incorrect reading-frame assignment.

If 6.9% of hypothetical protein genes in E. coli are in the wrong reading reading frame, that's over 100 genes. (Hypothetical genes comprise 19% of E. coli's 4205 CDS genes.)

When I made a cursory spot check of "hypothetical protein" genes in E. coli B, I quickly found a number of genes (ECB_00841, ECB_01484, ECB_01676, ECB_02804, ECB_03339, and others) that were incorrectly designated as to the sense versus antisense strand. These genes produced more and better BLAST hits when reverse-translated than when forward-translated. Some of them had stop codons in the middle, but that's okay. That just means they're pseudogenes. It's vitally important to correctly identify pseudogenes. (In most bacterial genomes, pseudogenes are vastly underreported, because they're hard to distinguish from non-coding regions.)

Bottom line, there's ample reason to believe a substantial fraction of the 793 "hypothetical protein" genes in E. coli are in fact either in the wrong reading frame, are pseudogenes, or have an improperly located start codon (or harbor other serious errors).

I don't think E. coli is exceptional. I'm convinced misannotation is a serious issue affecting 20% or more of all machine-annotated genes. Some annotation programs do a better job than others of hiding problems like overlap resolution and small-gene detection, but the problems are there, bigtime, and aren't likely to go away any time soon.