blogorrhea: pseudogenes

Showing posts with label pseudogenes. Show all posts

Sunday, May 25, 2014

Chiggers, Scrub Tyhpus, and Pseudogenes

If you've ever been bitten by tiny red bugs in the garden, you're familiar with members of the Trombiculidae, a family of mites known variously as berry bugs, harvest mites, red bugs, scrub-itch mites, aoutas, or (in the southern U.S.) "chiggers."

In the United States., the garden-variety chigger is basically harmless, but in much of the world this tiny arthropod comes with a very nasty endosymbiont known as Orientia tsutsugamushi, which is a bacterium related to the Rickettsia organisms that cause various tick-borne diseases. Throughout much of the Orient, O. tsutsugamushi infections (from chigger bites) cause scrub typhus, which begins with a rash and fever but can progress to a cough, intestinal distress, swelling of the spleen, abnormal liver chemistry, and ultimately pneumonitis, encephalitis, and/or myocarditis and even death. Treatment with doxycycline, azithromycin, or chloramphenicol is usually successful.

The "harvest mite" (chigger) can carry scrub
typhus, although U.S varieties are typically harmless.

The sequenced genome for O. tsutsugamushi is available, and if you go to this link and click on "Click for features" at the bottom of the Dataset Information box you should be able to open up a table that shows the organism as having 1,182 protein-coding genes (quite a small number), plus an additional 1,994 pseudogenes (quite a huge number, by comparison). The "DNA Seqs" links in the table will let you download the DNA sequences of all the organism's genes and pseudogenes.

This is an extremely unusual situation, in that we're dealing with a bacterium that has more pseudogenes (switched-off, defunct, damaged genes) than regular genes, something that can be said of no other bacterium of which I'm aware. The leprosy bacterium (Mycobacterium leprae) is famed for having approximately 1100 pseudogenes and 1604 "normal" genes. Astonishingly, Orientia tsutsugamushi reverses that ratio, and then some.

We don't know for sure how old Orientia tsutsugamushi's pseudogenes are. A standard rule of thumb in biology is that microbial genomes experience one spontaneous mutation per chromosome per 300 generations. But this doesn't really help us decide how old Orientia's pseudogenes are, since the pseudogenes probably didn't arise one by one, indepedently, through accumulation of random mutations. More than likely, a massive pseudogenization event caused the simultaneous deactivation of a large, unknown number of the organism's genes (of which 1100 survive today as pseudogenes), much the same as has been hypothesized for M. leprae. We have good reason to believe M. leprae's pseudogenes are at least 9 million years old. It seems likely that the pseudogenes in Orientia are also quite old, or at least not terribly new.

To get more perspective on this, I analyzed Orientia's pseudogenes from a couple of perspectives. What I found, first of all, is that the pseudogenes are shorter than their non-pseudo counterparts, averaging 700 bases in length (versus 879 for normal genes). This is similar to the case with M. leprae (where pseudogenes are 795 bases long and normal genes average 1,098). The average shorter gene length for Orientia vis-a-vis M. leprae is consistent with the fact that this is a greatly gene-reduced low-GC (30.5%) endosymbiont, whereas the Mycobacterium family is (in theory) free-living, with higher GC content (57.8% for M. leprae; 65% or more for tuberculosis species).

I've written before about the fact that in most genes, in most organisms, codons tend to begin with a purine base. Therefore I decided to look at purine usage in base one of normal-gene codons versus pseudogene codons (pseudocodons?), finding the following distribution in normal genes:

Purine usage in base one of codons in Orientia tsutsugamushi (N=346,326 codons). No pseudogenes were included in this graph. See the next graph (below) for pseudogenes.

This graph leaves little doubt that most codons begin with a purine (A or G). The median AG1 value is 63.8%. Very few proteins lie to the left of x=0.50, and frankly some of those are probably misannotated as to reading frame.

The situation with pseudogenes is quite a bit different:

Purines in codon base one (AG1) of pseudogenes (N=462,933 codons) in Orientia.

Here we see that purine usage in codon base one is not as strong (median 58.4%), although clearly, plenty of codons still show AG1 above 60%, implying that many pseudogenes are still "in frame" (not frameshifted).

Interestingly, AG1 is not only higher in normal-gene codons than in pseudogene codons, it's also higher in codons associated with proteins of known function than for "hypothetical protein" genes. Only 41.3% of pseudogene codons have AG1 greater than 60%, whereas 66.7% of "hypothetical protein" genes have AG1 > 60% and 84.3% of genes with functional assignments have codon AG1 greater than 60%. This implies that some genes annotated as hypothetical proteins may, in reality, be pseudogenes that are incorrectly annotated. I'll return to that topic some other time.

Wednesday, May 21, 2014

The Pseudogene Hall of Fame

For "budget reasons" (supposedly), the Joint Genome Institute now requires all users to be registered and approved before they can use the (taxpayer-funded) https://img.jgi.doe.gov site. Fortunately, my registration was approved and I can use the excellent online genomics tools there, one of which produced the following table of Top Ten Organisms by Pseudogene Count.

Organism	Genes	Pseudogenes
Mus musculus C57BL/6J	60745	11398
Rattus norvegicus BN/SsNHsdMCW	38115	9178
Homo sapiens	38612	7186
Candidatus Burkholderia kirkii	6325	4011
Arabidopsis thaliana Columbia	31392	3818
Nostoc azollae 0708	5379	1670
Stenotrophomonas maltophilia SKK35	4760	1589
Pseudomonas aeruginosa VRFPA01	5775	1523
Stenotrophomonas maltophilia RA8	5016	1507
Mycobacterium leprae TN	2750	1086

Again: Don't expect the above links to work if you're not a registered JGI user. (I don't know if they will work for you or not.) This list was automagically generated by the Department of Energy's Joint Genomes Institute and I thought you might get the same kick out of it that I got. It's eye-opening to see the ratio of pseudogenes to "normal" genes in these organisms. Isn't it?

Taking the top three spots are mouse, rat, and humans. (Note: These counts should be taken with a bit of caution, as some have estimated the number of pseudogenes in the human genome to be much higher than the 7,186 shown here.) All the other spots in the chart except Arabidopsis (which is a leafy plant) are bacteria. The leprosy bacterium, which I've written about before, comes in tenth place.

If you're not familiar with the concept of pseudogenes, you might want to look at this post. Basically we're talking about genes that are thought to be disabled and no longer functional in the normal sense, although they may well be functional in some as-yet-unappreciated sense. (Otherwise, evolutionary theory says they should have been eliminated from most genomes eons ago.)

Personally, I believe pseudogenes are as much a feature of DNA as regular genes; certainly in higher life forms, they occur in great numbers. The vast majority of bacterial genomes in public databases are shown as having no pseudogenes. I find that (how shall I say?) not at all credible. Some day it will be obvious that almost every genome harbors pseudogenes; we simply lack smart enough software to detect them all right now.

Tuesday, April 15, 2014

Coming to Grips with Pseudogenes

The term pseudogene was coined in 1977, when Jacq et al. discovered a version of the gene coding for 5S rRNA in the African clawed frog (Xenopus laevis) that was truncated yet retained homology with the active gene. Subsequent work has shown that in higher life forms, pseudogenes (genes that have been inactivated through one event or another) are almost as numerous as coding genes, with (for example) the human genome containing 10,000 or more pseudogenes. (A more recent estimate puts the number at 20,000.) Many of these pseudogenes are highly conserved. Looking at pseudogenes in the mouse and human, Svensson et al. found that of a group of 74 such genes that occur in both species, 30 appear to have been conserved since before the evolutionary divergence of mice and humans.

In higher organisms, pseudogenes are sometimes transcribed into RNA, with the RNA filling a regulatory function. For example, Korneev et al. found that simultaneous transcription of neural nitric oxide synthase (nNOS) and the antisense strand of a homologous pseudogene in the same neurons of Lymnaea stagnalis (a snail) leads to the formation of a duplex between the two strands and a reduction in nNOS translation. Further examples can be found in Pink et al. (2007), "Pseudogenes: Pseudo-functional or key regulators in health and disease?"

In bacteria, pseudogenes are somewhat rarer than in eukaryotes, but exist in significant numbers in many pathogens (including many species of Mycobacterium, Shigella, Brucella, Bordetella, and others). A study by Kuo and Ochman (2010) found that pseudogenes are swiftly eliminated from Salmonella. They describe "evidence of a strong deletional bias in Salmonella, such that genes that are not maintained by selection are rapidly inactivated and eliminated by mutational events." In fact, Kuo and Ochman found that pseudogenes are eliminated more rapidly than could be explained by the so-called neutral theory of evolution, indicating that the continued presence of pseudogenes exacts a high cost to the cell.

And yet, many bacteria with slow-evolving genomes (such as Mycobacterium species) retain their pseudogenes with high fidelity across evolutionary timespans. The most celebrated "pseudogene hoarder" of all time, M. leprae (the leprosy bacterium) appears to have acquired its 1000+ pseudogenes 9 to 20 million years ago. Meanwhile, the half-life of pseudogenes in Buchnera aphidicola was measured at 23.9 million years—a staggering number.

So on the one hand, we have work by Kuo and Ochman showing that pseudogenes in bacteria are rapidly eliminated, and on the other hand we have some bacterial lineages in which it seems pseudogenes are not only conserved but actively repaired over periods of tens of millions of years!

In Chapter 5 of Brucella: Molecular Microbiology and Genomics (2012, Caister Academic Press), Garcia-Lobo et al. describe their work with RNA sequence data from the bacterium Brucella abortus:

Twenty-four of the genes selected from the RNAseq data were annotated as pseudogenes in the B. abortus 2308 genome, which was considered a rather unexpected finding. By comparison with other Brucella genomes we can reduce the list of highly expressed pseudogenes to 16 (often, truncated parts of a gene are annotated as different pseudogenes especially in B. abortus 2308). This seems contradictory since high transcription of these genes, which should be not able to translate into functional proteins, will be contrary to biological economy. The high levels of transcription observed for these genes strongly suggest that they could be active genes and their products may perform functions unreported in metabolic reconstructions. High pseudogene expression may also indicate that these are very recently produced pseudogenes that did not turned down transcription yet by accumulation of mutations in their promoter or control regions. It is also possible that these pseudogenes may contain sequencing errors and they are indeed active genes.

It's almost comically obvious from this passage that the authors are troubled by their own finding that some pseudogenes in Brucella are highly transcribed. They try explaining it away by saying it could all be "sequencing errors."

A more parsimonious view is that pseudogenes that haven't been eliminated from a genome are, in fact permanent, legitimate fixtures of the landscape, in microbes just as in higher life forms. And as in higher life forms, pseudogenes in microbes are probably serving perfectly understandable regulatory functions (when they're not actually translated into protein products).

Kuo and Ochman have convincingly shown that useless pseudogenes are quickly eliminated. It follows that any pseudogenes that aren't swiftly eliminated are, in fact, serving a biological purpose, or else they wouldn't be there. This line of reasoning is already well accepted by researchers who study eukaryotic life forms. Those who study bacteria need to take a hint from their up-the-food-chain colleagues.

What could the hundreds of pseudogenes in Bordetella pertussis (or the 1000+ pseudogenes in M. leprae) be doing? First we need to get used to the idea that in bacteria, virtually all genes are transcribed, in both directions. It's been four years since Dornernberg et al. reported finding ~1000 antisense transcripts in E. coli, but no one seems to have gotten the memo.

A section of Rothia mucilaginosa genome (top) and a corresponding portion of Mycobacterium leprae (bottom); click to enlarge. The yellow gene, in each case, is DnaE (error-prone polymerase). Pink bands indicate areas of 65% or more homology between the two organisms. The small-diameter silver genes in the lower panel are M. leprae pseudogenes. "Normal genes" are shown in green. Notice that R. mucilaginosa has open reading frames on both strands of DNA, with many bidirectionally overlapping genes.

A look at the genome of the bacterium Rothia mucilaginosa DY18 shows that a very large proportion of "normal genes" have open reading frames on the opposite strand (see illustration). Bidirectional overlapping genes run throughout the Rothia genome. A massive annotation error? Maybe. Or maybe both strands are transcribed.

If massive wholesale transcription of antisense strands occurs in E. coli, as we know it does, certainly it's no stretch to imagine it occurring in Rothia mucilaginosa. And if it is occurring in Rothia, which is (incidentally) an opportunistic pathogen, how much harder can it be to imagine it occurring in another well-known pathogenic member of the Actinomycetales family, Mycobacterium leprae? We know already that upwards of 40% of M. leprae pseudogenes are transcribed. Antisense transcripts could well be playing a role in silencing certain gene essential genes when attempts are made to grow the organism in defined media. Forward transcripts could be producing nonsense or partial-nonsense/truncated proteins that are excreted as toxins or find their way to the cell wall as surface antigens. Any number of scenarios might be possible.

Some very low-hanging fruit is available to micobiologists who are willing to accept the obvious. Instead of wishing away pseudogenes or imagining them to be useless baggage, we should be looking at them as potential determinants of pathogenicity. We should consider their possible roles in modulating protein expression patterns. We should attempt to learn why they're conserved; what role(s) they're playing in cell physiology. The last thing in the world we should be doing is calling them "junk DNA."

Monday, April 14, 2014

Pseudogenes Are Not Junk DNA

In 2007, a PLoS ONE paper by Ahmed et al. proposed a phylogeny for Mycobacteria in which M. leprae (the leprosy organism) is shown as a relatively recent branch off a very long tree, with M. tuberculosis depicted (in a decidedly fanciful schematic) as being of relatively recent provenance (35,000 years), diverging from M. canettii (a recently discovered cousin of tuberculosis) 3 million years ago.

The rather fanciful phylogenetic picture of Mycobacterium evolution presented by Ahmed et al. (2007). Click to enlarge.

The only trouble with this picture is that we know it's wrong. More exacting work has shown that M. tuberculosis is at least 3 million years old, and one paper estimates that the common ancestor of TB and leprosy may go back 66 million years. If the latter figure sounds dubious, consider that until recently, M. leprae wasn't thought to have any sister strains that could aid with dating the organism phylogenetically. But in 2008, the situation changed dramatically when it was realized that in Mexico, a distinct form of leprosy known as "diffuse lepromatous leprosy" (DLL) was actually due to a genetically distinct variant of Mycobacterium known as M. lepromatosis. When the genome for the latter organism was analyzed, it was found to contain the same stupendous assortment of pseudogenes contained in M. leprae, but detailed analysis of polymorphisms in the genomes of the two strains led to a surprising finding: Divergence of the strains appears to have occurred around 10 million years ago.

Another team found that the massive "pseudogenization event" that caused M. leprae (and its cousin, M. lepromatosis) to become saddled with a record number (1,116) of pseudogenes probably occurred on the order of 20 million years ago.

The age and stability of the pseudogenes in M. leprae can only be described as stunning. Conventional evolutionary dogma says that pseudogenes will inevitably be degraded and lost over time. Surely M. leprae can't be conserving and repairing pseudogenes over 10-million-year-long timespans? Pseudogenes are discardable junk.

Or are they?

An analysis of Buchnera aphidicola (the tiny Enterobacterial endosymbiont of the pea aphid) put the half-life of pseudogenes in that organism at 23.9 million years.

Human DNA reportedly contains over 12,000 pseudogenes. Some of these pseudogenes are quite old. Parallel nonsense mutations caused a pseudogenization of the uricase gene in apes during the early Miocene era (17 million years ago). We still carry the pseudogene in question—and it gets transcribed. According to a report by James T. Kratzer and colleagues at the University of Texas, Austin:

Despite being nonfunctional, cDNA sequencing confirmed that uricase mRNA is present in human liver cells and that these transcripts have two premature stop codons.

The inevitable conclusion is that pseudogenes are not, and should not be considered by default, "junk DNA." To the contrary, the default assumption should be that pseudogenes are ancient and conserved—because in most cases, that's exactly what they are.

What causes genes to "go pseudo"? Why are they conserved? What are they really doing? I'll tackle some of those questions in a followup post. Stay tuned.

Sunday, April 13, 2014

Whooping Cough Genomics

Pertussis, also known as whooping cough, is a highly contagious respiratory infection caused by Bordetella pertussis, a small aerobic bacterium that secretes numerous toxins capable of disrupting a normal immune response. The disease is rarely fatal but leaves victims with a nasty cough that can last weeks. In 2012, in the U.S., some 48,277 cases of pertussis were reported to the CDC. Of those cases, only 20 were fatal. By contrast, 28 Americans were killed by lightning the same year.

Bordetella pertussis

Unlike tuberculosis (which has been with us for 3 million years), Bordetella shows evidence of being a fairly new (and still rapidly evolving) pathogen, although in this case "fairly new" could still mean 700,000 years.

The complete DNA sequence of B. pertussis has been available for several years. It shows a moderate-size genome (of 4 million base pairs) encoding 3,447 genes, with a substantial number (360) of pseudogenes. The latter represent genes that have (by one means or another) been inactivated, whether through the appearance of premature stop codons in the gene, loss of a promoter region, random deletions, or what have you.

What makes Bordetella's pseudogenes interesting is that they're in remarkably good shape, as pseudogenes go. Usually, once a gene gets inactivated (goes pseudo), it begins to accumulate random point mutations, deletions, insertions, etc. at a substantial rate. In other words it deteriorates, since (supposedly) it's no longer under selection pressure. But when Australian researchers looked at 358 pseudogenes in B. pertussis Tohama I strain, they were shocked to find that the rate of nucleotide polymorphisms (i.e., changes to individual base-pairs in the DNA) was actually lower in pseudogenes than in regular genes (4.7E-5 per site versus 5.1). That's exactly the opposite of what's expected. The researchers commented, somewhat laconically: "This suggests that most pseudogenes in B. pertussis were formed in the recent past and are yet to accumulate more mutations than functional genes."

What other explanation is there? Well, the most obvious alternative explanation is that the genes are still under selection pressure, even though they're turned off. How can that be? I can think of any number of scenarios; perhaps that'll be a future blog post. Suffice it for now to say, ribosomes are not totally unforgiving of missing stop codons (read up on tmRNA) nor are they unforgiving, in all cases, of frameshifts (read about programmed frameshifts), and if an open reading frame should appear on a pseudogene's antisense strand, you now have an RNA silencer (potentially) for the remaining good copy or copies of the gene, with attendant gene-modulation possibilities.

It's worth pointing out that pseudogenes in M. leprae (the leprosy bacterium) are not only conserved and ancient but continue to show strong homology to working orthologues in M. tuberculosis (and even more distantly related organisms such as Gordonia, Corynebacterium, and Nocardia) after millions of years. More of which, in a later post.

For now, I thought it might be worth looking at the base composition of B. pertussis pseudogenes to see if they're riddled with frameshift errors (as is the case with M. leprae's pseudogenes). When I analyzed all 1,125,521 codons for all normal (not pseudo) genes in B. pertussis Tohama I strain, the resulting "paintball diagram" of base composition came up looking like this:

Paintball diagram for normal genes in B. pertussis Tohama I (click to enlarge). Red dots are for codon base one, gold represents the composition at codon base two, blue is "wobble" (third) base composition. Every dot represents statistics for one gene (n=3447). See text for discussion.

Here, we're looking at purine (A+G) content versus G+C content for each base position in the codons. Every dot represents a gene's worth of data. Not unexpectedly, the most extreme G+C values occur in the third ("wobble") base. Codon base one (red dots) is purine-rich, centering on y=0.58. This is typical of most codons in most genes, in most organisms. Notice the "breakaway cloud" of gold points underneath the main gold cloud (at y<0.4). These points represent genes in which the second codon base is mostly a pyrimidine (C or T). Codons with a pyrimidine in base two tend to code for nonpolar amino acids. Thus, the breakaway cloud of gold points represents membrane-associated proteins. In this case, we're looking at about 558 genes falling in that category.

Now look at the paintball diagram for the organism's 360 pseudogenes:

Base composition for "codons" in 360 pseudogenes of B. pertussis Tohama I. (Click to enlarge.) In this graph, as in the one above, dots are rendered with an opacity of 60% (so that overlapping points are less likely to obscure each other). See text for discussion.

In this case, there's a considerable amount of random statistical splay, but some of that is due simply to the fact that pseudogenes are a good deal shorter than normal genes, giving rise to more noise in the signal. (In this case, the average length of a pseudogene is 482 bases, vs. 982 for the 3,447 "normal" genes.) Even with considerable noise, though, it's apparent that the dot clusters tend to center on different parts of the graph, corresponding to the expected locations for normal genes. (Contrast this with the situation in M. leprae, where pseudogenes are riddled with frameshifts, rendering the concept of "codon base position" moot. Refer to the second paintball graph on this page.) Thus, we can say with some confidence that frameshifts are not so rampant in B. pertussis pseudogenes as to have rendered the concept of codons irrelevant. In fact, compared to M. leprae, pseudogenes in B. pertussis are comparatively unaffected by frameshifts. This tends to support the view of the Australian researchers (mentioned earlier) that pseudogenes in B. pertussis have not had enough time to accumulate very many mutations. But it can also be hypothesized that B. pertussis has had plenty of time (700,000 years, in fact) in which to accumulate mutations in its pseudogenes, yet has not done so. The evidence suggests that if anything, Bordetella repairs pseudogenes even more faithfully than regular genes.

At this point it might be relevant to interject that while M. leprae (like other members of the Mycobacteria) lacks the MutS/MutL mismatch repair system, Bordetella does, in fact, have a MutS/MutL mismatch repair system, and this may explain the relative paucity of frameshift errors in Bordetella pseudogenes. But it also implies (rather queerly) that Bordetella goes out of its way to repair its pseudogenes.

Interestingly, 234 out of 360 pseudogenes have a AG1 (purine, base one) content greater than 55%, which means they're probably still "in frame." Of these 234, some 69 (30%) have AG2 less than 40%, meaning they're most likely genes for membrane-associated proteins. If we look at the 2,456 normal genes that have AG1 greater than 55%, only 398 (16%) are putative membrane-associated proteins (with AG2 less than 40%). Bottom line: Pseudogenes for putative membrane-associated proteins are twice as likely to still be in-frame. While this could be a statistical fluke, it could also be that membrane proteins are somehow "spared" preferentially when it comes to leaving pseudogenes translatable. To put it differently: Pseudogenes for non-membrane-associated proteins are less likely to remain in-frame. This makes sense, in that much of Bordetella's pathogenicity can be ascribed to proteins that make up cell-surface antigens or that transport toxins to the outside world. Some of the toxic surface proteins may, in fact, be nonsense (or partial-nonsense) proteins—products of pseudogenes.

Wednesday, April 09, 2014

Are dead genes still alive in the leprosy bacillus?

The genome of the leprosy bacterium (Mycobacterium leprae) stands as a remarkable example of DNA in an apparent state of massive, wholesale breakdown. Of the organism's 2720 genes, only 1604 appear to be functional, while 1116 are pseudogenes, which is to say genes that have been "turned off" and left for dead.

Genes can become pseudogenes in any number of ways, including loss of a start codon, loss of promoter regions (or degraded Shine Dalgarno signals), random insertions and deletions, mutations that cause spurious stop codons, and so on. Once a gene gets "turned off," assuming loss of the gene in question isn't fatal, the gene typically undergoes a period of degradation (leading to its eventual loss from the genome), but that's not exactly what we see in the leprosy bacterium. When leprosy germs from medieval skeletons were sampled and their genomes sequenced, researchers found that pseudogenes in M. leprae haven't changed very much in the past thousand years or so. Not only does M. leprae tend to hold onto its pseudogenes, it actively transcribes upwards of 40% of them. Probably not all of the transcripts result in expressed proteins (many lack a start codon!), but some no doubt do get translated into proteins. Let's put it this way: It would be extremely unusual for an organism to conserve this many pseudogenes if none of them was doing anything useful.

This view of a segment of the two genomes shows how a region of around 80,000 base pairs in M. tuberculosis maps to a similar 68,000-base-pair region of M. leprae. Notice that in the lowermost panel (representing M. leprae), many genes are shown as shrunken silver segments instead of fat green cylinders. The smaller grey/silver segments are pseudogenes. Click to enlarge.

To get a better idea of what's going on here, I downloaded the DNA sequences of M. leprae's 1604 "normal" genes as well as the 1116 pseudogenes. In analyzing the codons for these genes, I looked for signs of genes that were still in the normal reading frame. One way to detect this is by measuring the purine content at the various base positions in a gene's codons. In a typical protein-coding gene, around 60% of codons begin with A or G (adenine or guanine). This positional bias will, of course, be lost in a gene that has undergone frameshift mutations. Among M. leprae's 1116 pseudogenes, I found 269 in which codons showed an average AG1 percentage (A+G content, codon base one) of 55% or more. These are pseudogenes that appear to still be mostly "in frame."

Things get a lot more interesting where putative membrane proteins are concerned. In a previous post, I showed that in some genes, the second codon base is pyrimidine-rich (i.e., predominantly C or T: cytosine or thymine); these genes encode proteins with a high percentage of nonpolar amino acids. Bottom line, if a gene's codons are mostly T or C in the second position, that gene most likely encodes a membrane-associated protein. (See my previous post for some data.) This is true for all organisms (viruses, cells) and organellar genes, too, by the way, not just M. leprae. It's a generic feature of the genetic code.

When I segregated M. leprae pseudogenes according to whether or not the second codon base was (on average) less than, or more than, 40% purines, I stumbled onto something quite interesting. I found 51 pseudogenes with AG2 less than 40% (meaning, these are probably membrane-associated proteins). Of those, 32 (or 62%) are still "in frame," with AG1 > 55%. By contrast, the majority (78%) of non-membrane pseudogenes (AG2 > 40%) appear to be turned off, with an average AG1 of 51%.

Long story short: Most non-membrane-associated pseudogenes are out-of-frame (and likely dead), whereas 62% of putative membrane-associated pseudogenes appear to be in-frame, and therefore could still be functional (or at least, undead).

In looking at stop codons, I found that of the pseudogenes that still had stop codons, the average distance to the first stop codon is only 149 bases (whereas the average pseudogene length is 795 bases). Pseudogenes for putative membrane-associated proteins were shorter overall (as membrane proteins often are; 495 bases instead of 795), but the average distance to the first stop codon was 190 bases, significantly longer than for the other pseudogenes. This suggests some of them are still alive.

By now you're probably wondering how the heck a pseudogene can be of any possible use whatsoever when it contains a premature stop codon. The thing we need to ask, though, is why M. leprae tolerates (indeed conserves) so many pseudogenes in the first place. Could it be that the organism has adapted a frameshift-tolerant translation apparatus? Maybe some of the stop codons aren't really stop codons.

We know that a wide variety of organisms (not just viruses, where this phenomenon was first discovered, but bacteria and eukaryotes) have evolved special signals to tell ribosomes to shift in and out of frame by plus or minus one. (See "A Gripping Tale of Ribosomal Frameshifting: Extragenic Suppressors of Frameshift Mutations Spotlight P-Site Realignment," Atkins and Björk, Microbiol. Mol. Biol. Rev. 2009.) Certain tRNAs participate in "quadruplet codon" decoding, making it possible for special frameshift signals to work. The signals usually involve 7-base-long "slippery heptamer" sequences, such as CCCTGAC, right where a stop codon (TGA) appears. In other words, when a stop codon appears inside a slippery heptamer, it's not really a stop codon. Depending on the kinds (and amounts) of tRNAs "on duty," it can be a frameshift signal.

When I looked for CCCTGAC in M. leprae's pseudogenes, I found 16 in-frame occurrences of the sequence in 1116 pseudogenes. (Only 7 occurrences of the hexamer CCCTGA were found, in frame, in M. leprae's "normal" genes.) While this doesn't prove that M. leprae is up to any unusual translation tricks, it's a tantalizing result. Also bear in mind, if M. leprae is indeed up to some unusual tricks, it may very well be using frameshift signals other than (or in addition to) CCCTGAC. The fact that Mycobacterium species lack a MutS/MutL mismatch repair system means M. leprae may have adapted different ways of coping with "slippery repeats."

Further work will be needed to confirm whether M. leprae indeed translates some of its pseudogenes into proteins. The 32 "high likelihood" pseudogenes that, according to my analysis, might still encode functional (or at least expressed) membrane-associated proteins are shown in the table below. Leave a comment if you have additional thoughts.

M. leprae pseudogenes that have codons with overall AG1 > 55% and AG2 < 40%:

Pseudogene	Possible product
MLBr00146	hypothetical protein
MLBr00189	hypothetical protein
MLBr00278	conserved hypothetical protein
MLBr00341	hypothetical protein
MLBr00460	hypothetical protein
MLBr00478	hypothetical protein
MLBr00738	PstA component of phosphate uptake
MLBr00836	hypothetical protein
MLBr00846	ABC transporter
MLBr01054	possible PPE-family protein
MLBr01156	hypothetical protein
MLBr01237	possible cytochrome P450
MLBr01238	probable cytochrome P450
MLBr01400	possible membrane protein
MLBr01414	PGRS-family protein
MLBr01474	hypothetical protein
MLBr01527	dihydrodipicolinate reductase
MLBr01673	conserved hypothetical protein
MLBr01792	probable Na+/H+ exchanger
MLBr01968	PE family protein
MLBr02003	probable ketoacyl reductase
MLBr02101	conserved hypothetical protein
MLBr02150	molybdopterin converting factor subunit 1
MLBr02190	PstA component of phosphate uptake
MLBr02216	dihydrolipoamide dehydrogenase
MLBr02363	19 kDa antigenic lipoprotein
MLBr02477	PE protein
MLBr02484	transcriptional regulator (LysR family)
MLBr02533	PE-family protein
MLBr02656	conserved hypothetical protein
MLBr02674	possible membrane protein

Wednesday, April 02, 2014

Frameshift errors in leprosy bacterium DNA

Shocking as it might sound, leprosy continues to strike over 200,000 persons per year worldwide, making it as much of a health problem as cholera or yellow fever. One of the oldest known infectious diseases, leprosy became the first disease to be causally linked to bacteria when Hansen made his famous discovery of the connection to Mycobacterium leprae in 1873. Ever since then, scientists have been trying to grow M leprae in the lab, to no avail. Like most environmental isolates, M. leprae defies attempts at pure culture. The only way to grow it in the lab is to infect mice or armadillos, where it has a doubling time of 14 days, the longest known generation time of any bacterium.

Traditionally, it has been assumed that the difficulty in growing M. leprae in pure culture is due to the organism's complex nutritional requirements. (In humans, the organism is an obligate intracellular parasite that takes up residency in the Schwann cells of the peripheral nervous system.) There is no doubt considerable truth to this assumption, but the reason for the organism's fastidious nutritional requirements wasn't fully known until Cole et al. (2001) showed that half the bacterium's genome is inoperative and undergoing decay. Genomic sequencing revealed that M. leprae has only three quarters the DNA content of its (quite robust) cousin, M. tuberculosis, and of M. leprae's 3,000-or-so remaining genes, only 1,600 are fully functional. The rest are pseudogenes.

Pseudogenes are genes that have become inactivated through loss of start codons, loss of promoter regions, introduction of spurious stop codons, introduction of frameshift errors, or through other causes. Almost all organisms contain pseudogenes in their DNA. (Human DNA reportedly contains over 12,000 pseudogenes.) The leprosy bacterium, however, is unique in having approximately half its genome tied up in pseudogenes. Once a gene becomes a pseudogene, it is effectively useless baggage ("junk DNA") and continues on a long path of deterioration. Evolutionary theory predicts that such genes will eventually be lost from the genome, since the carrying cost of keeping them puts the organism at a disadvantage, energetically. But the curious thing about M. leprae is that it's a hoarder: It not only holds onto its useless genes, it actually transcribes upwards of 40% of them. In fact, a recent study of 1000-year-old M. leprae DNA (recovered from medieval skeletons), comparing the medieval version of the organism's genome with the genome of today's M. leprae, found that pseudogenes are highly conserved in the bacterium.

The fact that the bacterium actually transcribes many of its pseudogenes (and doesn't lose them over time) is striking, to say the least, and suggests that the transcription of certain genes or pseudogenes is resulting in mRNAs that silence other, more deleterious genes. It could be that M. leprae can't be grown in culture because when certain combinations of nutrients are presented to it, the nutrients up-regulate deleterious nonsense genes in otherwise-normal operons (or down-regulate important silencers), directly or indirectly. (Williams et al. found that many M. leprae pseudogenes are located in the middle of operons and are transcribed via fortuitous read-through.) Various scenarios are possible. Much work remains to be done.

In the meantime, I couldn't help doing a little desktop science to characterize M. leprae's "defective genes" problem further. I went to http://genomevolution.org/CoGe/OrganismView.pl and entered "Mycobacterium leprae Br4923" in the Organism Name field. In the Genome Information box, if you click the "Click for Features" link, you can see that 1604 genes are labeled "CDS" (meaning, these are the operative, non-defective genes) while a separate line item shows an utterly astounding 2233 genes as pseudogenes. (Addendum: The FASTA file at genomevolution.org contains duplicates. The actual pseudogene count, it turns out, is 1116, not 2233. But still, 1116 is a huge number of pseudogenes.) The "DNA Seqs" links on the right side of that page allow you to download the FASTA sequences for the respective gene groupings. These are simple text files containing the base sequences (A, T, G, and C) for the coding strands of the genes.

I wrote a few lines of JavaScript to analyze the base compositions of the genes (and pseudogenes), and what I noticed immediately is that the base composition differs for the two groups:

Base	Content (Genes)	Content (Pseudogenes)
A	0.1938	0.2119
G	0.3116	0.2867
C	0.2890	0.2778
T	0.2046	0.2223

The G+C content for the "normal" genes averages 60.6%, whereas for the pseudogenes it's 55.4%. A typical G+C value for other members of the genus Mycobacterium is 65%. Thus, it's clear that not only the pseudogenes but the "normal" genes of M. leprae have drifted in the direction of more A+T. This has been noted before (by Cole et al. and others). What's perhaps less obvious is that purine content (A+G) has shifted from 50.5% in the normal genes to 49.8% in the pseudogenes. Bear in mind we're looking at data for one strand of DNA: the so-called coding or "message" strand.

Clearly, there is a tendency for pseudogenes to "regress to the mean." But the shift in purine concentration is particularly interesting, because it indicates that purine usage in normal-gene coding regions is perhaps non-randomly elevated. The shift from 50.5% to 49.8% in A+G content may not seem particularly striking on its own, but the difference, it turns out, is highly significant. You can see why in the following graph.

Base composition of "normal" genes in M. leprae (total purines vs. G+C) by codon base position. (n=1604) Red dots are for base one, gold dots are for base two, blue dots are for base 3 (the "wobble" base). Click to enlarge. See text for discussion.

To make this graph, I looked at the DNA of the coding regions of "normal" genes and determined the average purine content as well as the G+C content for positions one, two, and three of all codons. As you can see, the purine content (relative to the G+C content) segregates non-randomly according to codon base position. The red dots represent base one, the gold (or brown) dots represent base two, and the blue dots represent base three (often called the "wobble" base, for historical reasons). Not unexpectedly, the greatest G+C shift occurs in base three (as is usually the case). What's perhaps more surprising is the clear preference for purines in base one. The red cluster centers at y = 0.6051 plus or minus 0.0467 (standard deviation). This means that on average, position one of a codon is occupied by a purine (A or G) over 60% of the time. This is actually quite typical of codons in most organisms. I've looked at over 1,300 bacterial species so far, and in all of them, purines accumulate at codon base one. (Maybe in a future post, I'll present more data to this effect.)

Base two segregates out as having a G+C content significantly below the organism's total-genome G+C content and centers on y = 0.4434 (median) plus or minus 0.0547 (SD).

Now compare the above graph with a similar graph for M. leprae's pseudogenes:

Base composition of M. leprae pseudogenes by codon position. (n=2233) Again, red dots are for base one, gold are for base two, blue are for base three. Click to enlarge. See text for discussion.

Here, it's evident that base compositions for all three codon positions overlap significantly. The fact that the codon positions are no longer clearly defined in their spatial representation on this graph is consistent with widespread frameshift mutations in the DNA, causing bases that would normally be in position one (or two or three) to be in some other position, randomly.

Hence we can say, with some confidence, on the basis of these graphs, that many (if not most) of the "junk genes" in M. leprae harbor frameshift mutations. The question of which came first—frameshift mutations, or silencing of genes (followed by frameshifts)—is still open. But we know for certain frameshifts are indeed rampant in the M. leprae pseudogenome.

Exactly how or why M. leprae accumulated so many frameshift mutations (and then kept hoarding the mutated genes) is unknown. As I said earlier, much work remains to be done.

Note: Graphs were produced using the excellent service at ZunZun.com. Hand-editing of SVG graphs (before conversion to PNG) enabled easy modification of the data-point colors in a text editor. Data points were plotted with opacity = 0.30 so that areas of high overlap are more apparent visually (with the piling of data points on top of data points).

Bioinformaticists (and others!), feel free to leave a comment below.