Thursday, April 17, 2014

The Pathogen's Playbook

When comparing pathogenic bacteria with non-pathogenic species of the same genus or family, we often find a common pattern. In the pathogen:
  • The genome is often reduced in size (particularly in endosymbionts, but also in others).
  • The genome is often shifted in the direction of higher A+T content (lower G+C content).
  • Many pseudogenes are present.
  • Often, the pathogen is a slow-grower in pure culture (if it can be cultured at all).
  • The pathogen has special nutritional needs.
An extreme case that illustrates all of these points is Mycobacterium leprae, the leprosy bacterium. It has fewer genes than its cousin, M. tuberculosis (which in turn has fewer genes than non-pathogenic Mycobacteria); its genomic G+C content is 8% lower than most other Mycobacteria; it contains over 1100 pseudogenes; it has a doubling time of two weeks; and it cannot be grown in pure culture (presumably because of fastidious nutritional requirements).

M. tuberculosis can be grown in the laboratory, but it (and its M. avium-group cousins) are very slow growers, taking anywhere from four days to two weeks to develop colonies on solid media.

It seems likely that some pathogens (certainly members of the Mycobacteria, but also the tiny Tenericutes, e.g. Mycoplasma, among many others) have evolved slow growth as a survival strategy. Certainly, organisms that have evolved an intracellular parasitic lifestyle need to be careful not to out-grow the host, if the relationship is to be a long one.

All of the factors listed above suggest a certain scenario, a "pathogen's playbook," if you will, which can be summarized as follows:
  1. The organism invades a warm-booded host.
  2. Phagocytes (white blood cells) ingest the organism.
  3. The phagocytes undergo a respiratory burst, flooding the microbe(s) with peroxides, hypochlorites, nitrous oxide, and other noxious oxidants.
  4. The flood of reactive oxygenated species triggers an SOS response in the microbe.
  5. The microbe's DNA undergoes massive damage. 
  6. Any surviving microbial cells are now pathogenic.
The SOS response is known to trigger mutagenicity. In Mycobacterium, for example, peroxides (as well as UV light) can induce up-regulation of dnaE, an error-prone polymerase. Since Mycobacteria are known to lack a MutS mismatch repair system, SOS-induced errors in DNA replication will almost certainly include uncorrected frameshift errors leading to the creation of pseudogenes. But that's a good thing, if you're a Mycobacterium interested in forming a longterm relationship with a host cell. The loss of certain genes (as long as they're not essential!) will likely slow your metabolism and make you dependent on host nutrients. Truly non-essential pseudogenes will simply be jettisoned over time, reducing the footprint of the remaining genome. Any pseudogenes that survive will likely have done so because they're now playing an essential gene-silencing role.

Let's expand on that last part. Take the dnaE gene, for example. M leprae has two copies of this gene, only one of which is functional. Suppose both copies were functional at the time of the massive pseudogenization event that converted so many of M. leprae's genes to pseudogenes 9 to 20 million years ago. After the pseudogenization event (probably a phagocytic respiratory burst), one copy of dnaE became a pseudogene. But continued transcription of the pseudogene in the forward direction means the pseudo-mRNA competes with the "normal" dnaE transcript for ribosomal attention. Transcription of the antisense strand of the disabled gene would, of course, create a messenger RNA product that could silence the normal transcript by doublestranded interaction. Either way, once the pseudogenization event is over, dnaE expression is attenuated—as it should be, once pathogenicity has been established.

Is it realistic to think M. leprae transcribes antisense strands of its pseudogenes? Given that E. coli has been found to contain ~1000 antisense transcripts, and given that we know M. leprae transcribes many of its pseudogenes, I think the answer has to be yes.

So the pattern is: infection, respiratory burst, massive mutation, silencing of many genes, and (oh by the way) creation of many brand-new gene products, some of them no doubt quite toxic to the host, as the result of gene truncation and pseudogene expression.

Tuesday, April 15, 2014

Coming to Grips with Pseudogenes

The term pseudogene was coined in 1977, when Jacq et al. discovered a version of the gene coding for 5S rRNA in the African clawed frog (Xenopus laevis) that was truncated yet retained homology with the active gene. Subsequent work has shown that in higher life forms, pseudogenes (genes that have been inactivated through one event or another) are almost as numerous as coding genes, with (for example) the human genome containing 10,000 or more pseudogenes. (A more recent estimate puts the number at 20,000.) Many of these pseudogenes are highly conserved. Looking at pseudogenes in the mouse and human, Svensson et al. found that of a group of 74 such genes that occur in both species, 30 appear to have been conserved since before the evolutionary divergence of mice and humans.

In higher organisms, pseudogenes are sometimes transcribed into RNA, with the RNA filling a regulatory function. For example, Korneev et al. found that simultaneous transcription of neural nitric oxide synthase (nNOS) and the antisense strand of a homologous pseudogene in the same neurons of Lymnaea stagnalis (a snail) leads to the formation of a duplex between the two strands and a reduction in nNOS translation. Further examples can be found in Pink et al. (2007), "Pseudogenes: Pseudo-functional or key regulators in health and disease?"

In bacteria, pseudogenes are somewhat rarer than in eukaryotes, but exist in significant numbers in many pathogens (including many species of Mycobacterium, Shigella, Brucella, Bordetella, and others). A study by Kuo and Ochman (2010) found that pseudogenes are swiftly eliminated from Salmonella. They describe "evidence of a strong deletional bias in Salmonella, such that genes that are not maintained by selection are rapidly inactivated and eliminated by mutational events." In fact, Kuo and Ochman found that pseudogenes are eliminated more rapidly than could be explained by the so-called neutral theory of evolution, indicating that the continued presence of pseudogenes exacts a high cost to the cell.

And yet, many bacteria with slow-evolving genomes (such as Mycobacterium species) retain their pseudogenes with high fidelity across evolutionary timespans. The most celebrated "pseudogene hoarder" of all time, M. leprae (the leprosy bacterium) appears to have acquired its 1000+ pseudogenes 9 to 20 million years ago. Meanwhile, the half-life of pseudogenes in Buchnera aphidicola was measured at 23.9 million years—a staggering number.

So on the one hand, we have work by Kuo and Ochman showing that pseudogenes in bacteria are rapidly eliminated, and on the other hand we have some bacterial lineages in which it seems pseudogenes are not only conserved but actively repaired over periods of tens of millions of years!

In Chapter 5 of Brucella: Molecular Microbiology and Genomics (2012, Caister Academic Press), Garcia-Lobo et al. describe their work with RNA sequence data from the bacterium Brucella abortus:
Twenty-four of the genes selected from the RNAseq data were annotated as pseudogenes in the B. abortus 2308 genome, which was considered a rather unexpected finding. By comparison with other Brucella genomes we can reduce the list of highly expressed pseudogenes to 16 (often, truncated parts of a gene are annotated as different pseudogenes especially in B. abortus 2308). This seems contradictory since high transcription of these genes, which should be not able to translate into functional proteins, will be contrary to biological economy. The high levels of transcription observed for these genes strongly suggest that they could be active genes and their products may perform functions unreported in metabolic reconstructions. High pseudogene expression may also indicate that these are very recently produced pseudogenes that did not turned down transcription yet by accumulation of mutations in their promoter or control regions. It is also possible that these pseudogenes may contain sequencing errors and they are indeed active genes.
It's almost comically obvious from this passage that the authors are troubled by their own finding that some pseudogenes in Brucella are highly transcribed. They try explaining it away by saying it could all be "sequencing errors."

A more parsimonious view is that pseudogenes that haven't been eliminated from a genome are, in fact permanent, legitimate fixtures of the landscape, in microbes just as in higher life forms. And as in higher life forms, pseudogenes in microbes are probably serving perfectly understandable regulatory functions (when they're not actually translated into protein products).

Kuo and Ochman have convincingly shown that useless pseudogenes are quickly eliminated. It follows that any pseudogenes that aren't swiftly eliminated are, in fact, serving a biological purpose, or else they wouldn't be there. This line of reasoning is already well accepted by researchers who study eukaryotic life forms. Those who study bacteria need to take a hint from their up-the-food-chain colleagues.

What could the hundreds of pseudogenes in Bordetella pertussis (or the 1000+ pseudogenes in M. leprae) be doing? First we need to get used to the idea that in bacteria, virtually all genes are transcribed, in both directions. It's been four years since Dornernberg et al. reported finding ~1000 antisense transcripts in E. coli, but no one seems to have gotten the memo.

A section of Rothia mucilaginosa genome (top) and a corresponding portion of Mycobacterium leprae (bottom); click to enlarge. The yellow gene, in each case, is DnaE (error-prone polymerase). Pink bands indicate areas of 65% or more homology between the two organisms. The small-diameter silver genes in the lower panel are M. leprae pseudogenes. "Normal genes" are shown in green. Notice that R. mucilaginosa has open reading frames on both strands of DNA, with many bidirectionally overlapping genes.

A look at the genome of the bacterium Rothia mucilaginosa DY18 shows that a very large proportion of "normal genes" have open reading frames on the opposite strand (see illustration). Bidirectional overlapping genes run throughout the Rothia genome. A massive annotation error? Maybe. Or maybe both strands are transcribed.

If massive wholesale transcription of antisense strands occurs in E. coli, as we know it does, certainly it's no stretch to imagine it occurring in Rothia mucilaginosa. And if it is occurring in Rothia, which is (incidentally) an opportunistic pathogen, how much harder can it be to imagine it occurring in another well-known pathogenic member of the Actinomycetales family, Mycobacterium leprae? We know already that upwards of 40% of M. leprae pseudogenes are transcribed. Antisense transcripts could well be playing a role in silencing certain gene essential genes when attempts are made to grow the organism in defined media. Forward transcripts could be producing nonsense or partial-nonsense/truncated proteins that are excreted as toxins or find their way to the cell wall as surface antigens. Any number of scenarios might be possible.

Some very low-hanging fruit is available to micobiologists who are willing to accept the obvious. Instead of wishing away pseudogenes or imagining them to be useless baggage, we should be looking at them as potential determinants of pathogenicity. We should consider their possible roles in modulating protein expression patterns. We should attempt to learn why they're conserved; what role(s) they're playing in cell physiology. The last thing in the world we should be doing is calling them "junk DNA."

Monday, April 14, 2014

Pseudogenes Are Not Junk DNA

In 2007,  a PLoS ONE paper by Ahmed et al. proposed a phylogeny for Mycobacteria in which M. leprae (the leprosy organism) is shown as a relatively recent branch off a very long tree, with M. tuberculosis depicted (in a decidedly fanciful schematic) as being of relatively recent provenance (35,000 years), diverging from M. canettii (a recently discovered cousin of tuberculosis) 3 million years ago.

The rather fanciful phylogenetic picture of Mycobacterium evolution presented by Ahmed et al. (2007). Click to enlarge.

The only trouble with this picture is that we know it's wrong. More exacting work has shown that M. tuberculosis is at least 3 million years old, and one paper estimates that the common ancestor of TB and leprosy may go back 66 million years. If the latter figure sounds dubious, consider that until recently, M. leprae wasn't thought to have any sister strains that could aid with dating the organism phylogenetically. But in 2008, the situation changed dramatically when it was realized that in Mexico, a distinct form of leprosy known as "diffuse lepromatous leprosy" (DLL) was actually due to a genetically distinct variant of Mycobacterium known as M. lepromatosis. When the genome for the latter organism was analyzed, it was found to contain the same stupendous assortment of pseudogenes contained in M. leprae, but detailed analysis of polymorphisms in the genomes of the two strains led to a surprising finding: Divergence of the strains appears to have occurred around 10 million years ago.

Another team found that the massive "pseudogenization event" that caused M. leprae (and its cousin, M. lepromatosis) to become saddled with a record number (1,116) of pseudogenes probably occurred on the order of 20 million years ago.

The age and stability of the pseudogenes in M. leprae can only be described as stunning. Conventional evolutionary dogma says that pseudogenes will inevitably be degraded and lost over time. Surely M. leprae can't be conserving and repairing pseudogenes over 10-million-year-long timespans? Pseudogenes are discardable junk.

Or are they?

An analysis of Buchnera aphidicola (the tiny Enterobacterial endosymbiont of the pea aphid) put the half-life of pseudogenes in that organism at 23.9 million years.

Human DNA reportedly contains over 12,000 pseudogenes. Some of these pseudogenes are quite old. Parallel nonsense mutations caused a pseudogenization of the uricase gene in apes during the early Miocene era (17 million years ago). We still carry the pseudogene in question—and it gets transcribed. According to a report by James T. Kratzer and colleagues at the University of Texas, Austin:
Despite being nonfunctional, cDNA sequencing confirmed that uricase mRNA is present in human liver cells and that these transcripts have two premature stop codons.
The inevitable conclusion is that pseudogenes are not, and should not be considered by default, "junk DNA." To the contrary, the default assumption should be that pseudogenes are ancient and conserved—because in most cases, that's exactly what they are.

What causes genes to "go pseudo"? Why are they conserved? What are they really doing? I'll tackle some of those questions in a followup post. Stay tuned.

Sunday, April 13, 2014

Whooping Cough Genomics

Pertussis, also known as whooping cough, is a highly contagious respiratory infection caused by Bordetella pertussis, a small aerobic bacterium that secretes numerous toxins capable of disrupting a normal immune response. The disease is rarely fatal but leaves victims with a nasty cough that can last weeks. In 2012, in the U.S., some 48,277 cases of pertussis were reported to the CDC. Of those cases, only 20 were fatal. By contrast, 28 Americans were killed by lightning the same year.

Bordetella pertussis
Unlike tuberculosis (which has been with us for 3 million years), Bordetella shows evidence of being a fairly new (and still rapidly evolving) pathogen, although in this case "fairly new" could still mean 700,000 years.

The complete DNA sequence of B. pertussis has been available for several years. It shows a moderate-size genome (of 4 million base pairs) encoding 3,447 genes, with a substantial number (360) of pseudogenes. The latter represent genes that have (by one means or another) been inactivated, whether through the appearance of premature stop codons in the gene, loss of a promoter region, random deletions, or what have you.

What makes Bordetella's pseudogenes interesting is that they're in remarkably good shape, as pseudogenes go. Usually, once a gene gets inactivated (goes pseudo), it begins to accumulate random point mutations, deletions, insertions, etc. at a substantial rate. In other words it deteriorates, since (supposedly) it's no longer under selection pressure. But when Australian researchers looked at 358 pseudogenes in B. pertussis Tohama I strain, they were shocked to find that the rate of nucleotide polymorphisms (i.e., changes to individual base-pairs in the DNA) was actually lower in pseudogenes than in regular genes (4.7E-5 per site versus 5.1). That's exactly the opposite of what's expected. The researchers commented, somewhat laconically: "This suggests that most pseudogenes in B. pertussis were formed in the recent past and are yet to accumulate more mutations than functional genes."

What other explanation is there? Well, the most obvious alternative explanation is that the genes are still under selection pressure, even though they're turned off. How can that be? I can think of any number of scenarios; perhaps that'll be a future blog post. Suffice it for now to say, ribosomes are not totally unforgiving of missing stop codons (read up on tmRNA) nor are they unforgiving, in all cases, of frameshifts (read about programmed frameshifts), and if an open reading frame should appear on a pseudogene's antisense strand, you now have an RNA silencer (potentially) for the remaining good copy or copies of the gene, with attendant gene-modulation possibilities.

It's worth pointing out that pseudogenes in M. leprae (the leprosy bacterium) are not only conserved and ancient but continue to show strong homology to working orthologues in M. tuberculosis (and even more distantly related organisms such as Gordonia, Corynebacterium, and Nocardia) after millions of years. More of which, in a later post.

For now, I thought it might be worth looking at the base composition of B. pertussis pseudogenes to see if they're riddled with frameshift errors (as is the case with M. leprae's pseudogenes). When I analyzed all 1,125,521 codons for all normal (not pseudo) genes in B. pertussis Tohama I strain, the resulting "paintball diagram" of base composition came up looking like this:
Paintball diagram for normal genes in B. pertussis Tohama I (click to enlarge). Red dots are for codon base one, gold represents the composition at codon base two, blue is "wobble" (third) base composition. Every dot represents statistics for one gene (n=3447). See text for discussion.

Here, we're looking at purine (A+G) content versus G+C content for each base position in the codons. Every dot represents a gene's worth of data. Not unexpectedly, the most extreme G+C values occur in the third ("wobble") base. Codon base one (red dots) is purine-rich, centering on y=0.58. This is typical of most codons in most genes, in most organisms. Notice the "breakaway cloud" of gold points underneath the main gold cloud (at y<0.4). These points represent genes in which the second codon base is mostly a pyrimidine (C or T). Codons with a pyrimidine in base two tend to code for nonpolar amino acids. Thus, the breakaway cloud of gold points represents membrane-associated proteins. In this case, we're looking at about 558 genes falling in that category.

Now look at the paintball diagram for the organism's 360 pseudogenes:

Base composition for "codons" in 360 pseudogenes of B. pertussis Tohama I. (Click to enlarge.) In this graph, as in the one above, dots are rendered with an opacity of 60% (so that overlapping points are less likely to obscure each other). See text for discussion.

In this case, there's a considerable amount of random statistical splay, but some of that is due simply to the fact that pseudogenes are a good deal shorter than normal genes, giving rise to more noise in the signal. (In this case, the average length of a pseudogene is 482 bases, vs. 982 for the 3,447 "normal" genes.) Even with considerable noise, though, it's apparent that the dot clusters tend to center on different parts of the graph, corresponding to the expected locations for normal genes. (Contrast this with the situation in M. leprae, where pseudogenes are riddled with frameshifts, rendering the concept of "codon base position" moot. Refer to the second paintball graph on this page.) Thus, we can say with some confidence that frameshifts are not so rampant in B. pertussis pseudogenes as to have rendered the concept of codons irrelevant. In fact, compared to M. leprae, pseudogenes in B. pertussis are comparatively unaffected by frameshifts. This tends to support the view of the Australian researchers (mentioned earlier) that pseudogenes in B. pertussis have not had enough time to accumulate very many mutations. But it can also be hypothesized that B. pertussis has had plenty of time (700,000 years, in fact) in which to accumulate mutations in its pseudogenes, yet has not done so. The evidence suggests that if anything, Bordetella repairs pseudogenes even more faithfully than regular genes.

At this point it might be relevant to interject that while M. leprae (like other members of the Mycobacteria) lacks the MutS/MutL mismatch repair system, Bordetella does, in fact, have a MutS/MutL mismatch repair system, and this may explain the relative paucity of frameshift errors in Bordetella pseudogenes. But it also implies (rather queerly) that Bordetella goes out of its way to repair its pseudogenes.

Interestingly, 234 out of 360 pseudogenes have a AG1 (purine, base one) content greater than 55%, which means they're probably still "in frame." Of these 234, some 69 (30%) have AG2 less than 40%, meaning they're most likely genes for membrane-associated proteins. If we look at the 2,456 normal genes that have AG1 greater than 55%, only 398 (16%) are putative membrane-associated proteins (with AG2 less than 40%). Bottom line: Pseudogenes for putative membrane-associated proteins are twice as likely to still be in-frame. While this could be a statistical fluke, it could also be that membrane proteins are somehow "spared" preferentially when it comes to leaving pseudogenes translatable. To put it differently: Pseudogenes for non-membrane-associated proteins are less likely to remain in-frame. This makes sense, in that much of Bordetella's pathogenicity can be ascribed to proteins that make up cell-surface antigens or that transport toxins to the outside world. Some of the toxic surface proteins may, in fact, be nonsense (or partial-nonsense) proteins—products of pseudogenes.

Saturday, April 12, 2014

The Most Deadly Pathogen of All Time

Few bacterial species have had as great an impact on humankind as the members of the Mycobacterium family, which encompass the causative agents of (among other ailments) leprosy, tuberculosis, and Crohn's Disease in humans, and Johne's Disease in farm animals. Leprosy is known from antiquity and continues to strike 200,000 or more people each year worldwide. Tuberculosis, which affects (subclinically) one in three persons worldwide, continues to kill well over a million people a year and has caused a billion deaths in the last two centuries, more than all the wars and genocides of history combined.

The association of M. avium subspecies paratuberculosis (MAP) with Crohn's Disease is still considered controversial by some, but if in fact Koch's criteria have already been met, MAP adds millions more to the toll of human misery caused by Mycobacterial infection.

Colonies of Mycobacterium have a
characteristically waxy consistency.
Shown here: colonies of M. tuberculosis.
What are these bacteria? Where did they come from? How have they managed to be so successful in causing death and disease?

The prefix "myco" means fungal, but these are not fungi we're talking about. Mycobacteria are soil- and water-borne bacteria that produce an extraordinarily complex cell wall containing not only the usual (for bacteria) peptidoglycans but also:
  • Arabinogalactan
  • Mycolic acids
  • Lipoarabinomannan
  • Extractable lipids including glycolipids, phenolic glycolipids (PGL), glycopeptidolipids (GPL), waxes, acylated trehaloses, and sulfolipids
In contrast to most cell-wall fatty acids (which contain carbon-carbon double bonds susceptible to oxidation), mycolic acids are cyclopropanated and resistant to oxidation, not to mention extremely hydrophobic. The Mycobacterial cell wall thus presents a formidable physical barrier to antibiotics, and it was with considerable dismay that physicians realized, early on, that penicillin would have no benefit in treating tuberculosis. When an antibiotic that could attack M. tuberculosis was finally discovered (streptomycin), it resulted in a 1952 Nobel Prize for Ukrainian American Selman Waksman (although in reality the discovery was made by a post-doc in Waksman's lab, Albert Schatz).

The Mycobacterial cell wall is famously complex, but it also has the curious habit of disappearing entirely, under nutrient-starvation conditions. Like many other bacteria, Mycobacteria can, under certain conditions, shed their cell walls and take on a so-called L-form morphology, in which cells (bounded only by a thin and osmotically vulnerable cell membrane containing just 7% of the usual amount of peptidoglycan) exist as protoplasts which are nonetheless able to reproduce and thrive, producing distinctive colonies on solid media and giving rise, in vivo, to tiny spherules that are often confused with Russell bodies in cancer biopsies. The medical significance of the mysterious L-forms is still debated, after more than 100 years.

The very small red filaments here are cells of  
Mycobacterium avium living inside lymph-node
macrophages in an immunocompromised individual.
One thing most Mycobacterial species have in common is slow growth. Cultures of M. tuberculosis and MAP often require weeks to develop, and M. leprae (which can't be grown in pure culture at all; it can be lab-grown only in the footpads of mice or armadillos) has the longest known generation time of any bacterium, at two weeks.

Ironically, pathogenic strains of Mycobacterium seem to have evolved slow growth as a survival strategy. (This certainly makes them hard to treat with antibiotics. Most antibiotics are effective only in disrupting the growth of actively growing cells.) The lack of DNA mismatch repair enzyme systems (MutS, MutL, and MutH) may be an outcome of the fact that slow DNA replication in these organisms, in and of itself, ensures reasonably high-fidelity replication. On the other hand, lack of a mismatch repair system could be why pseudogenes (genes inactivated due to frameshifts or other errors) abound in Mycobacterial species. M. leprae famously has over 1000 pseudogenes; M. smegmatis strain JS623 harbors over 200 pseudogenes; M. canettii (strain CIPT 140010059) and M. rhodesiae (strain NBB3) both have over 100. (For a good review of Mycobacterial DNA repair systems, see this 2011 paper.)

Unlike Yersinia pestis, the plague organism, which may be less than 20,000 years old (very young in bacterial species time), M. tuberculosis, as a species, appears to be at least 3 million years old, although this number should probably be considered a minimum age, subject to upward revision. (The species was thought to be only 35,000 years old as late as 2002, before a more detailed genetic analysis established the 3-million-year estimate of its age. The numbers should be viewed with caution, however, since they're based on mutation-rate assumptions derived from data for E. coli.)

The question of how M. tuberculosis has managed to achieve its distinctive pathogenic profile is a matter of active ongoing research, and likely will be for a long time. A recent review article reminds us: "The [complete genome] sequence of the pathogen Mycobacterium tuberculosis strain H37Rv has been available for over a decade, but the biology of the pathogen remains poorly understood."

Miscellaneous Links
List of famous T.B. victims—Brontë family, Balzac, Kafka, Thoreau, Kant, Chekhov, Orwell, Schrödinger, Vivien Leigh, Arline Feynman (wife of the famous physicist), the list goes on.
Tuberculosis in Literature and the Arts
The T.B. Blues (Jimmie Rodgers, 1931) This song, famously covered by Leon Redbone (among others), was written by Rodgers after he contracted the disease at age 27. He died eight years later.
World Health Organization TB Stats (landing page)
The Tuberculosis Systems Biology Program

Wednesday, April 09, 2014

Are dead genes still alive in the leprosy bacillus?

The genome of the leprosy bacterium (Mycobacterium leprae) stands as a remarkable example of DNA in an apparent state of massive, wholesale breakdown. Of the organism's 2720 genes, only 1604 appear to be functional, while 1116 are pseudogenes, which is to say genes that have been "turned off" and left for dead.

Genes can become pseudogenes in any number of ways, including loss of a start codon, loss of promoter regions (or degraded Shine Delgarno signals), random insertions and deletions, mutations that cause spurious stop codons, and so on. Once a gene gets "turned off," assuming loss of the gene in question isn't fatal, the gene typically undergoes a period of degradation (leading to its eventual loss from the genome), but that's not exactly what we see in the leprosy bacterium. When leprosy germs from medieval skeletons were sampled and their genomes sequenced, researchers found that pseudogenes in M. leprae haven't changed very much in the past thousand years or so. Not only does M. leprae tend to hold onto its pseudogenes, it actively transcribes upwards of 40% of them. Probably not all of the transcripts result in expressed proteins (many lack a start codon!), but some no doubt do get translated into proteins. Let's put it this way: It would be extremely unusual for an organism to conserve this many pseudogenes if none of them was doing anything useful.

This view of a segment of the two genomes shows how a region of around 80,000 base pairs in M. tuberculosis maps to a similar 68,000-base-pair region of M. leprae. Notice that in the lowermost panel (representing M. leprae), many genes are shown as shrunken silver segments instead of fat green cylinders. The smaller grey/silver segments are pseudogenes. Click to enlarge.

To get a better idea of what's going on here, I downloaded the DNA sequences of M. leprae's 1604 "normal" genes as well as the 1116 pseudogenes. In analyzing the codons for these genes, I looked for signs of genes that were still in the normal reading frame. One way to detect this is by measuring the purine content at the various base positions in a gene's codons. In a typical protein-coding gene, around 60% of codons begin with A or G (adenine or guanine). This positional bias will, of course, be lost in a gene that has undergone frameshift mutations. Among M. leprae's 1116 pseudogenes, I found 269 in which codons showed an average AG1 percentage (A+G content, codon base one) of 55% or more. These are pseudogenes that appear to still be mostly "in frame."

Things get a lot more interesting where putative membrane proteins are concerned. In a previous post, I showed that in some genes, the second codon base is pyrimidine-rich (i.e., predominantly C or T: cytosine or thymine); these genes encode proteins with a high percentage of nonpolar amino acids. Bottom line, if a gene's codons are mostly T or C in the second position, that gene most likely encodes a membrane-associated protein. (See my previous post for some data.) This is true for all organisms (viruses, cells) and organellar genes, too, by the way, not just M. leprae. It's a generic feature of the genetic code.

When I segregated M. leprae pseudogenes according to whether or not the second codon base was (on average) less than, or more than, 40% purines, I stumbled onto something quite interesting. I found 51 pseudogenes with AG2 less than 40% (meaning, these are probably membrane-associated proteins). Of those, 32 (or 62%) are still "in frame," with AG1 > 55%. By contrast, the majority (78%) of non-membrane pseudogenes (AG2 > 40%) appear to be turned off, with an average AG1 of 51%.

Long story short: Most non-membrane-associated pseudogenes are out-of-frame (and likely dead), whereas 62% of putative membrane-associated pseudogenes appear to be in-frame, and therefore could still be functional (or at least, undead).

In looking at stop codons, I found that of the pseudogenes that still had stop codons, the average distance to the first stop codon is only 149 bases (whereas the average pseudogene length is 795 bases). Pseudogenes for putative membrane-associated proteins were shorter overall (as membrane proteins often are; 495 bases instead of 795), but the average distance to the first stop codon was 190 bases, significantly longer than for the other pseudogenes. This suggests some of them are still alive.

By now you're probably wondering how the heck a pseudogene can be of any possible use whatsoever when it contains a premature stop codon. The thing we need to ask, though, is why M. leprae tolerates (indeed conserves) so many pseudogenes in the first place. Could it be that the organism has adapted a frameshift-tolerant translation apparatus? Maybe some of the stop codons aren't really stop codons.

We know that a wide variety of organisms (not just viruses, where this phenomenon was first discovered, but bacteria and eukaryotes) have evolved special signals to tell ribosomes to shift in and out of frame by plus or minus one. (See "A Gripping Tale of Ribosomal Frameshifting: Extragenic Suppressors of Frameshift Mutations Spotlight P-Site Realignment," Atkins and Björk, Microbiol. Mol. Biol. Rev. 2009.) Certain tRNAs participate in "quadruplet codon" decoding, making it possible for special frameshift signals to work. The signals usually involve 7-base-long "slippery heptamer" sequences, such as CCCTGAC, right where a stop codon (TGA) appears. In other words, when a stop codon appears inside a slippery heptamer, it's not really a stop codon. Depending on the kinds (and amounts) of tRNAs "on duty," it can be a frameshift signal.

When I looked for CCCTGAC in M. leprae's pseudogenes, I found 16 in-frame occurrences of the sequence in 1116 pseudogenes. (Only 7 occurrences of the hexamer CCCTGA were found, in frame, in M. leprae's "normal" genes.) While this doesn't prove that M. leprae is up to any unusual translation tricks, it's a tantalizing result. Also bear in mind, if M. leprae is indeed up to some unusual tricks, it may very well be using frameshift signals other than (or in addition to) CCCTGAC. The fact that Mycobacterium species lack a MutS/MutL mismatch repair system means M. leprae may have adapted different ways of coping with "slippery repeats."

Further work will be needed to confirm whether M. leprae indeed translates some of its pseudogenes into proteins. The 32 "high likelihood" pseudogenes that, according to my analysis, might still encode functional (or at least expressed) membrane-associated proteins are shown in the table below. Leave a comment if you have additional thoughts.

M. leprae pseudogenes that have codons with overall AG1 > 55% and AG2 < 40%:

Pseudogene Possible product
MLBr00146 hypothetical protein
MLBr00189 hypothetical protein
MLBr00278 conserved hypothetical protein
MLBr00341 hypothetical protein
MLBr00460 hypothetical protein
MLBr00478 hypothetical protein
MLBr00738 PstA component of phosphate uptake
MLBr00836 hypothetical protein
MLBr00846 ABC transporter
MLBr01054 possible PPE-family protein
MLBr01156 hypothetical protein
MLBr01237 possible cytochrome P450
MLBr01238 probable cytochrome P450
MLBr01400 possible membrane protein
MLBr01414 PGRS-family protein
MLBr01474 hypothetical protein
MLBr01527 dihydrodipicolinate reductase
MLBr01673 conserved hypothetical protein
MLBr01792 probable Na+/H+ exchanger
MLBr01968 PE family protein
MLBr02003 probable ketoacyl reductase
MLBr02101 conserved hypothetical protein
MLBr02150 molybdopterin converting factor subunit 1
MLBr02190 PstA component of phosphate uptake
MLBr02216 dihydrolipoamide dehydrogenase
MLBr02363 19 kDa antigenic lipoprotein
MLBr02477 PE protein
MLBr02484 transcriptional regulator (LysR family)
MLBr02533 PE-family protein
MLBr02656 conserved hypothetical protein
MLBr02674 possible membrane protein

Tuesday, April 08, 2014

Why do so many codons begin with a purine?

With the advent of sites like (where you can download genomes, create synteny graphs, run BLAST searches, and do all sorts of desktop bioinformatics), it's ridiculously easy for someone interested in comparative genomics to . . . well, compare genomes, for one thing. And if you look at enough gene sequences, a couple of things pop out.

One thing that pops out is that most codons, in most genes, begin with a purine (namely A or G: adenine or guanine). Also, codons typically show the greatest GC swing in base number three. These trends can be seen in the chart below, where I show average base composition (by codon position) for three well-studied organisms. For clarity, base-one purines are shown in bold and base-three G and C are shown highlighted in yellow.

Codon base
S. griseus
0.166 0.434 0.287 0.112
0.224 0.219 0.295 0.261
0.037 0.394 0.530 0.038
E. coli
0.256 0.343 0.238 0.161
0.291 0.181 0.222 0.304
0.186 0.285 0.261 0.265
C. botulinum
0.395 0.299 0.094 0.210
0.374 0.136 0.161 0.328
0.442 0.108 0.064 0.383

Streptomyces griseus is a common soil bacterium that happens to have very high genomic G+C content (72.1% overall, although you can see that in base three of codons the G+C content is more like 92%).

E. coli represents a middle-of-the-road organism in terms of G+C content (50.8% overall), while our ugly friend Clostridium botulinum (the soil organism that can ruin your whole day if it finds its way into a can of tomatoes) has very low genomic G+C content (around 28%).

Even though these organisms differ greatly in G+C content, they all illustrate the (universal) trend toward usage of purines (A or G) in the first position of a codon. Something like 59% to 69% of the time (depending on the organism), codons look like Pnn, where P is a purine base and 'n' represents any base. This is true for viruses as well as cellular genomes.

This pattern is so universal, one wonders why it exists. I think a credible, parsimonious explanation is that when protein-coding genes look like PnnPnnPnn... (etc.) it makes for a crisp reading frame. It's easy to see that a +1 frameshift results in a repeating nPn pattern and a +2 frameshift results in repeats of nnP. These are easily distinguished from Pnn.

There are benefits for a PnnPnnPnn... reading frame. In a previous post, I showed that when most of a gene's codons have a pyrimidine in base two, the resulting protein gets shipped to the cell membrane. (This is a simple consequence of the fact that codons with a pyrimidine in position two tend to code for hydrophobic, lipid-soluble amino acids.) Because a +1 reading-frame shift produces repeats of nPn, the Pnn "default" pattern means that +1 frameshifted gene products, if they occur, won't get shipped to the cell membrane. This is an extremely important outcome, because membrane proteins are, in general, highly transcribed and under strong selective pressure. In addition to specifying antigenic properties and determining phage resistance, membrane proteins make up proton pumps, secretion systems, symporters, kinases, flagellar components, and many other kinds of proteins. They determine the cell's "interface" to the world. They also maintain cell osmolarity and membrane redox potential. Messing with membrane proteins is bound to be risky. Much better to keep frameshifted nonsense proteins away from the membrane.

Fairly strong support for this notion (that Pnn codons provide a crisp reading frame) comes from studies of naturally occurring frameshift signals in DNA. We now know that in many organisms, certain "slippery" DNA signals (usually heptamers, like CCCTGAC) instruct the ribosome to change reading frames. (See, for example, "A Gripping Tale of Ribosomal Frameshifting: Extragenic Suppressors of Frameshift Mutations Spotlight P-Site Realignment," Atkins and Björk, Microbiol. Mol. Biol. Rev. 2009. Also, for fun, be sure to check out some of the papers on quadruplet decoding, which leaves room for alien life forms with 200 amino acids instead of 20.) The "slippery heptamer" frameshift signals that have thus far been identified tend to contain runs of pyrimidines.

Also tending to support the "Pnn = crisp reading frame" notion is the fact that stop codons (TGA, TAA, TAG) look like pPP (where 'p' is a pyrimidine and 'P' is a purine). Again, a crisp distinction.

As for why purines were chosen (and not pyrimidines) to begin the Xxx pattern, again I think a fairly parsimonious answer is available: ATP and GTP are the most abundant nucleoside-triphosphates in vivo. These are the energy sources for nucleic-acid and protein synthesis, respectively.

A prediction: If we run into an alien life form (in the oceans of Europa, say) and it turns out to be the case that UTP (instead of ATP) is the "universal energy molecule" in that life form's cells, then that life form's codons will probably begin with U and form Unn triplets (or Unnn quadruplets, perhaps) a higher-than-average percentage of the time.