Tuesday, April 15, 2014

Coming to Grips with Pseudogenes

The term pseudogene was coined in 1977, when Jacq et al. discovered a version of the gene coding for 5S rRNA in the African clawed frog (Xenopus laevis) that was truncated yet retained homology with the active gene. Subsequent work has shown that in higher life forms, pseudogenes (genes that have been inactivated through one event or another) are almost as numerous as coding genes, with (for example) the human genome containing 10,000 or more pseudogenes. (A more recent estimate puts the number at 20,000.) Many of these pseudogenes are highly conserved. Looking at pseudogenes in the mouse and human, Svensson et al. found that of a group of 74 such genes that occur in both species, 30 appear to have been conserved since before the evolutionary divergence of mice and humans.

In higher organisms, pseudogenes are sometimes transcribed into RNA, with the RNA filling a regulatory function. For example, Korneev et al. found that simultaneous transcription of neural nitric oxide synthase (nNOS) and the antisense strand of a homologous pseudogene in the same neurons of Lymnaea stagnalis (a snail) leads to the formation of a duplex between the two strands and a reduction in nNOS translation. Further examples can be found in Pink et al. (2007), "Pseudogenes: Pseudo-functional or key regulators in health and disease?"

In bacteria, pseudogenes are somewhat rarer than in eukaryotes, but exist in significant numbers in many pathogens (including many species of Mycobacterium, Shigella, Brucella, Bordetella, and others). A study by Kuo and Ochman (2010) found that pseudogenes are swiftly eliminated from Salmonella. They describe "evidence of a strong deletional bias in Salmonella, such that genes that are not maintained by selection are rapidly inactivated and eliminated by mutational events." In fact, Kuo and Ochman found that pseudogenes are eliminated more rapidly than could be explained by the so-called neutral theory of evolution, indicating that the continued presence of pseudogenes exacts a high cost to the cell.

And yet, many bacteria with slow-evolving genomes (such as Mycobacterium species) retain their pseudogenes with high fidelity across evolutionary timespans. The most celebrated "pseudogene hoarder" of all time, M. leprae (the leprosy bacterium) appears to have acquired its 1000+ pseudogenes 9 to 20 million years ago. Meanwhile, the half-life of pseudogenes in Buchnera aphidicola was measured at 23.9 million years—a staggering number.

So on the one hand, we have work by Kuo and Ochman showing that pseudogenes in bacteria are rapidly eliminated, and on the other hand we have some bacterial lineages in which it seems pseudogenes are not only conserved but actively repaired over periods of tens of millions of years!

In Chapter 5 of Brucella: Molecular Microbiology and Genomics (2012, Caister Academic Press), Garcia-Lobo et al. describe their work with RNA sequence data from the bacterium Brucella abortus:
Twenty-four of the genes selected from the RNAseq data were annotated as pseudogenes in the B. abortus 2308 genome, which was considered a rather unexpected finding. By comparison with other Brucella genomes we can reduce the list of highly expressed pseudogenes to 16 (often, truncated parts of a gene are annotated as different pseudogenes especially in B. abortus 2308). This seems contradictory since high transcription of these genes, which should be not able to translate into functional proteins, will be contrary to biological economy. The high levels of transcription observed for these genes strongly suggest that they could be active genes and their products may perform functions unreported in metabolic reconstructions. High pseudogene expression may also indicate that these are very recently produced pseudogenes that did not turned down transcription yet by accumulation of mutations in their promoter or control regions. It is also possible that these pseudogenes may contain sequencing errors and they are indeed active genes.
It's almost comically obvious from this passage that the authors are troubled by their own finding that some pseudogenes in Brucella are highly transcribed. They try explaining it away by saying it could all be "sequencing errors."

A more parsimonious view is that pseudogenes that haven't been eliminated from a genome are, in fact permanent, legitimate fixtures of the landscape, in microbes just as in higher life forms. And as in higher life forms, pseudogenes in microbes are probably serving perfectly understandable regulatory functions (when they're not actually translated into protein products).

Kuo and Ochman have convincingly shown that useless pseudogenes are quickly eliminated. It follows that any pseudogenes that aren't swiftly eliminated are, in fact, serving a biological purpose, or else they wouldn't be there. This line of reasoning is already well accepted by researchers who study eukaryotic life forms. Those who study bacteria need to take a hint from their up-the-food-chain colleagues.

What could the hundreds of pseudogenes in Bordetella pertussis (or the 1000+ pseudogenes in M. leprae) be doing? First we need to get used to the idea that in bacteria, virtually all genes are transcribed, in both directions. It's been four years since Dornernberg et al. reported finding ~1000 antisense transcripts in E. coli, but no one seems to have gotten the memo.

A section of Rothia mucilaginosa genome (top) and a corresponding portion of Mycobacterium leprae (bottom); click to enlarge. The yellow gene, in each case, is DnaE (error-prone polymerase). Pink bands indicate areas of 65% or more homology between the two organisms. The small-diameter silver genes in the lower panel are M. leprae pseudogenes. "Normal genes" are shown in green. Notice that R. mucilaginosa has open reading frames on both strands of DNA, with many bidirectionally overlapping genes.

A look at the genome of the bacterium Rothia mucilaginosa DY18 shows that a very large proportion of "normal genes" have open reading frames on the opposite strand (see illustration). Bidirectional overlapping genes run throughout the Rothia genome. A massive annotation error? Maybe. Or maybe both strands are transcribed.

If massive wholesale transcription of antisense strands occurs in E. coli, as we know it does, certainly it's no stretch to imagine it occurring in Rothia mucilaginosa. And if it is occurring in Rothia, which is (incidentally) an opportunistic pathogen, how much harder can it be to imagine it occurring in another well-known pathogenic member of the Actinomycetales family, Mycobacterium leprae? We know already that upwards of 40% of M. leprae pseudogenes are transcribed. Antisense transcripts could well be playing a role in silencing certain gene essential genes when attempts are made to grow the organism in defined media. Forward transcripts could be producing nonsense or partial-nonsense/truncated proteins that are excreted as toxins or find their way to the cell wall as surface antigens. Any number of scenarios might be possible.

Some very low-hanging fruit is available to micobiologists who are willing to accept the obvious. Instead of wishing away pseudogenes or imagining them to be useless baggage, we should be looking at them as potential determinants of pathogenicity. We should consider their possible roles in modulating protein expression patterns. We should attempt to learn why they're conserved; what role(s) they're playing in cell physiology. The last thing in the world we should be doing is calling them "junk DNA."