The manually curated database Swiss-Prot shows the lowest annotation error levels (close to 0% for most families); the two other protein sequence databases (GenBank NR and TrEMBL) and the protein sequences in the KEGG pathways database exhibit similar and surprisingly high levels of misannotation that average 5%–63% across the six superfamilies studied. For 10 of the 37 families examined, the level of misannotation in one or more of these databases is >80%. Examination of the NR database over time shows that misannotation has increased from 1993 to 2005. The types of misannotation that were found fall into several categories, most associated with “overprediction” of molecular function. These results suggest that misannotation in enzyme superfamilies containing multiple families that catalyze different reactions is a larger problem than has been recognized.There are several important aspects to the problem:
- Several types of gene discovery software are in common use (each with its own strengths and weaknesses) and there is little standardization of output.
- Most gene discovery programs (such as the widely used Glimmer) must be trained against high quality training-set data in order to produce reliable output.
- Faulty data, once it enters the gene databases, becomes part of the training set for other researchers trying to train their software. Hence, error propagation is a major ongoing problem.
- Most gene discovery programs are not good at resolving gene-overlap situations, sometimes relying on unsophisticated heuristics to determine sense/antisense strand assignment.
- Crosschecking of results is a timeconsuming, manual task and is rarely done for an entire genome.
- Public gene databases do not usually include scoring info (a rating system that rates the confidence level of a given gene's function assignment).
A confidence-scoring system is very much needed, so that a person can tell at a glance whether a given annotation is trustworthy, and to what extent.
Ironically, inability to find genes (underreporting of results) is not a problem. False positives, on the other hand, are a big problem. Glimmer 3 reportedly produces many fewer false positives than the previous version of the software, but probably two thirds of existing public genomes were annotated with Glimmer 2 and have yet to be revised.
BLAST searches (to crosscheck a gene against other genes) are of limited utility since there is already so much corrupt data in public databases. It's not uncommon to check a gene that has a questionable functional assignment and find that a BLAST search brings back genes with equally questionable functional assignments (because the same annotation software has choked before on orthologous genes). Nevertheless, sometimes careful examination of a gene in relation to nearby genes (which often share similar functionalities), inspection of its Shine Dalgarno region (if it exists), inspection of GC3 values, etc. can help verify that a gene really is what the annotation says it is. Sometimes it's even possible to assign a function to a gene that's marked as "hypothetical protein." This sort of work is tedious, but it's exactly the sort of work that could and should be crowdsourced to spare-time biohackers. It would help if a system were in place for volunteers to help with basic "sanity checking" of machine-annotated genes. It would take quite an army of volunteers to go through the 20,000 or so bacterial genomes (each with several thousand genes, on average) already on the books, but who knows if maybe a Craig Venter couldn't organize such an effort and make it work?
One thing is for sure. If we don't do something to improve genome annotation quality, we'll soon be drowning in junk annotations; never mind junk DNA.