Showing posts with label Mycobacterium leprae. Show all posts
Showing posts with label Mycobacterium leprae. Show all posts

Wednesday, April 09, 2014

Are dead genes still alive in the leprosy bacillus?

The genome of the leprosy bacterium (Mycobacterium leprae) stands as a remarkable example of DNA in an apparent state of massive, wholesale breakdown. Of the organism's 2720 genes, only 1604 appear to be functional, while 1116 are pseudogenes, which is to say genes that have been "turned off" and left for dead.

Genes can become pseudogenes in any number of ways, including loss of a start codon, loss of promoter regions (or degraded Shine Dalgarno signals), random insertions and deletions, mutations that cause spurious stop codons, and so on. Once a gene gets "turned off," assuming loss of the gene in question isn't fatal, the gene typically undergoes a period of degradation (leading to its eventual loss from the genome), but that's not exactly what we see in the leprosy bacterium. When leprosy germs from medieval skeletons were sampled and their genomes sequenced, researchers found that pseudogenes in M. leprae haven't changed very much in the past thousand years or so. Not only does M. leprae tend to hold onto its pseudogenes, it actively transcribes upwards of 40% of them. Probably not all of the transcripts result in expressed proteins (many lack a start codon!), but some no doubt do get translated into proteins. Let's put it this way: It would be extremely unusual for an organism to conserve this many pseudogenes if none of them was doing anything useful.

This view of a segment of the two genomes shows how a region of around 80,000 base pairs in M. tuberculosis maps to a similar 68,000-base-pair region of M. leprae. Notice that in the lowermost panel (representing M. leprae), many genes are shown as shrunken silver segments instead of fat green cylinders. The smaller grey/silver segments are pseudogenes. Click to enlarge.

To get a better idea of what's going on here, I downloaded the DNA sequences of M. leprae's 1604 "normal" genes as well as the 1116 pseudogenes. In analyzing the codons for these genes, I looked for signs of genes that were still in the normal reading frame. One way to detect this is by measuring the purine content at the various base positions in a gene's codons. In a typical protein-coding gene, around 60% of codons begin with A or G (adenine or guanine). This positional bias will, of course, be lost in a gene that has undergone frameshift mutations. Among M. leprae's 1116 pseudogenes, I found 269 in which codons showed an average AG1 percentage (A+G content, codon base one) of 55% or more. These are pseudogenes that appear to still be mostly "in frame."

Things get a lot more interesting where putative membrane proteins are concerned. In a previous post, I showed that in some genes, the second codon base is pyrimidine-rich (i.e., predominantly C or T: cytosine or thymine); these genes encode proteins with a high percentage of nonpolar amino acids. Bottom line, if a gene's codons are mostly T or C in the second position, that gene most likely encodes a membrane-associated protein. (See my previous post for some data.) This is true for all organisms (viruses, cells) and organellar genes, too, by the way, not just M. leprae. It's a generic feature of the genetic code.

When I segregated M. leprae pseudogenes according to whether or not the second codon base was (on average) less than, or more than, 40% purines, I stumbled onto something quite interesting. I found 51 pseudogenes with AG2 less than 40% (meaning, these are probably membrane-associated proteins). Of those, 32 (or 62%) are still "in frame," with AG1 > 55%. By contrast, the majority (78%) of non-membrane pseudogenes (AG2 > 40%) appear to be turned off, with an average AG1 of 51%.

Long story short: Most non-membrane-associated pseudogenes are out-of-frame (and likely dead), whereas 62% of putative membrane-associated pseudogenes appear to be in-frame, and therefore could still be functional (or at least, undead).

In looking at stop codons, I found that of the pseudogenes that still had stop codons, the average distance to the first stop codon is only 149 bases (whereas the average pseudogene length is 795 bases). Pseudogenes for putative membrane-associated proteins were shorter overall (as membrane proteins often are; 495 bases instead of 795), but the average distance to the first stop codon was 190 bases, significantly longer than for the other pseudogenes. This suggests some of them are still alive.

By now you're probably wondering how the heck a pseudogene can be of any possible use whatsoever when it contains a premature stop codon. The thing we need to ask, though, is why M. leprae tolerates (indeed conserves) so many pseudogenes in the first place. Could it be that the organism has adapted a frameshift-tolerant translation apparatus? Maybe some of the stop codons aren't really stop codons.

We know that a wide variety of organisms (not just viruses, where this phenomenon was first discovered, but bacteria and eukaryotes) have evolved special signals to tell ribosomes to shift in and out of frame by plus or minus one. (See "A Gripping Tale of Ribosomal Frameshifting: Extragenic Suppressors of Frameshift Mutations Spotlight P-Site Realignment," Atkins and Björk, Microbiol. Mol. Biol. Rev. 2009.) Certain tRNAs participate in "quadruplet codon" decoding, making it possible for special frameshift signals to work. The signals usually involve 7-base-long "slippery heptamer" sequences, such as CCCTGAC, right where a stop codon (TGA) appears. In other words, when a stop codon appears inside a slippery heptamer, it's not really a stop codon. Depending on the kinds (and amounts) of tRNAs "on duty," it can be a frameshift signal.

When I looked for CCCTGAC in M. leprae's pseudogenes, I found 16 in-frame occurrences of the sequence in 1116 pseudogenes. (Only 7 occurrences of the hexamer CCCTGA were found, in frame, in M. leprae's "normal" genes.) While this doesn't prove that M. leprae is up to any unusual translation tricks, it's a tantalizing result. Also bear in mind, if M. leprae is indeed up to some unusual tricks, it may very well be using frameshift signals other than (or in addition to) CCCTGAC. The fact that Mycobacterium species lack a MutS/MutL mismatch repair system means M. leprae may have adapted different ways of coping with "slippery repeats."

Further work will be needed to confirm whether M. leprae indeed translates some of its pseudogenes into proteins. The 32 "high likelihood" pseudogenes that, according to my analysis, might still encode functional (or at least expressed) membrane-associated proteins are shown in the table below. Leave a comment if you have additional thoughts.

M. leprae pseudogenes that have codons with overall AG1 > 55% and AG2 < 40%:

Pseudogene Possible product
MLBr00146 hypothetical protein
MLBr00189 hypothetical protein
MLBr00278 conserved hypothetical protein
MLBr00341 hypothetical protein
MLBr00460 hypothetical protein
MLBr00478 hypothetical protein
MLBr00738 PstA component of phosphate uptake
MLBr00836 hypothetical protein
MLBr00846 ABC transporter
MLBr01054 possible PPE-family protein
MLBr01156 hypothetical protein
MLBr01237 possible cytochrome P450
MLBr01238 probable cytochrome P450
MLBr01400 possible membrane protein
MLBr01414 PGRS-family protein
MLBr01474 hypothetical protein
MLBr01527 dihydrodipicolinate reductase
MLBr01673 conserved hypothetical protein
MLBr01792 probable Na+/H+ exchanger
MLBr01968 PE family protein
MLBr02003 probable ketoacyl reductase
MLBr02101 conserved hypothetical protein
MLBr02150 molybdopterin converting factor subunit 1
MLBr02190 PstA component of phosphate uptake
MLBr02216 dihydrolipoamide dehydrogenase
MLBr02363 19 kDa antigenic lipoprotein
MLBr02477 PE protein
MLBr02484 transcriptional regulator (LysR family)
MLBr02533 PE-family protein
MLBr02656 conserved hypothetical protein
MLBr02674 possible membrane protein

Wednesday, April 02, 2014

Frameshift errors in leprosy bacterium DNA

Shocking as it might sound, leprosy continues to strike over 200,000 persons per year worldwide, making it as much of a health problem as cholera or yellow fever. One of the oldest known infectious diseases, leprosy became the first disease to be causally linked to bacteria when Hansen made his famous discovery of the connection to Mycobacterium leprae in 1873. Ever since then, scientists have been trying to grow M leprae in the lab, to no avail. Like most environmental isolates, M. leprae defies attempts at pure culture. The only way to grow it in the lab is to infect mice or armadillos, where it has a doubling time of 14 days, the longest known generation time of any bacterium.

Traditionally, it has been assumed that the difficulty in growing M. leprae in pure culture is due to the organism's complex nutritional requirements. (In humans, the organism is an obligate intracellular parasite that takes up residency in the Schwann cells of the peripheral nervous system.) There is no doubt considerable truth to this assumption, but the reason for the organism's fastidious nutritional requirements wasn't fully known until Cole et al. (2001) showed that half the bacterium's genome is inoperative and undergoing decay. Genomic sequencing revealed that M. leprae has only three quarters the DNA content of its (quite robust) cousin, M. tuberculosis, and of M. leprae's 3,000-or-so remaining genes, only 1,600 are fully functional. The rest are pseudogenes.

Pseudogenes are genes that have become inactivated through loss of start codons, loss of promoter regions, introduction of spurious stop codons, introduction of frameshift errors, or through other causes. Almost all organisms contain pseudogenes in their DNA. (Human DNA reportedly contains over 12,000 pseudogenes.) The leprosy bacterium, however, is unique in having approximately half its genome tied up in pseudogenes. Once a gene becomes a pseudogene, it is effectively useless baggage ("junk DNA") and continues on a long path of deterioration. Evolutionary theory predicts that such genes will eventually be lost from the genome, since the carrying cost of keeping them puts the organism at a disadvantage, energetically. But the curious thing about M. leprae is that it's a hoarder: It not only holds onto its useless genes, it actually transcribes upwards of 40% of them. In fact, a recent study of 1000-year-old M. leprae DNA (recovered from medieval skeletons), comparing the medieval version of the organism's genome with the genome of today's M. leprae, found that pseudogenes are highly conserved in the bacterium.

The fact that the bacterium actually transcribes many of its pseudogenes (and doesn't lose them over time) is striking, to say the least, and suggests that the transcription of certain genes or pseudogenes is resulting in mRNAs that silence other, more deleterious genes.  It could be that M. leprae can't be grown in culture because when certain combinations of nutrients are presented to it, the nutrients up-regulate deleterious nonsense genes in otherwise-normal operons (or down-regulate important silencers), directly or indirectly. (Williams et al. found that many M. leprae pseudogenes are located in the middle of operons and are transcribed via fortuitous read-through.) Various scenarios are possible. Much work remains to be done.

In the meantime, I couldn't help doing a little desktop science to characterize M. leprae's "defective genes" problem further. I went to http://genomevolution.org/CoGe/OrganismView.pl and entered "Mycobacterium leprae Br4923" in the Organism Name field. In the Genome Information box, if you click the "Click for Features" link, you can see that 1604 genes are labeled "CDS" (meaning, these are the operative, non-defective genes) while a separate line item shows an utterly astounding 2233 genes as pseudogenes. (Addendum: The FASTA file at genomevolution.org contains duplicates. The actual pseudogene count, it turns out, is 1116, not 2233. But still, 1116 is a huge number of pseudogenes.) The "DNA Seqs" links on the right side of that page allow you to download the FASTA sequences for the respective gene groupings. These are simple text files containing the base sequences (A, T, G, and C) for the coding strands of the genes.

I wrote a few lines of JavaScript to analyze the base compositions of the genes (and pseudogenes), and what I noticed immediately is that the base composition differs for the two groups:

Base Content (Genes) Content (Pseudogenes)
A
0.1938
0.2119
G
0.3116
0.2867
C
0.2890
0.2778
T
0.2046
0.2223

The G+C content for the "normal" genes averages 60.6%, whereas for the pseudogenes it's 55.4%. A typical G+C value for other members of the genus Mycobacterium is 65%. Thus, it's clear that not only the pseudogenes but the "normal" genes of M. leprae have drifted in the direction of more A+T. This has been noted before (by Cole et al. and others). What's perhaps less obvious is that purine content (A+G) has shifted from 50.5% in the normal genes to 49.8% in the pseudogenes. Bear in mind we're looking at data for one strand of DNA: the so-called coding or "message" strand.

Clearly, there is a tendency for pseudogenes to "regress to the mean." But the shift in purine concentration is particularly interesting, because it indicates that purine usage in normal-gene coding regions is perhaps non-randomly elevated. The shift from 50.5% to 49.8% in A+G content may not seem particularly striking on its own, but the difference, it turns out, is highly significant. You can see why in the following graph.

Base composition of "normal" genes in M. leprae (total purines vs. G+C) by codon base position. (n=1604) Red dots are for base one, gold dots are for base two, blue dots are for base 3 (the "wobble" base). Click to enlarge. See text for discussion.

To make this graph, I looked at the DNA of the coding regions of "normal" genes and determined the average purine content as well as the G+C content for positions one, two, and three of all codons. As you can see, the purine content (relative to the G+C content) segregates non-randomly according to codon base position. The red dots represent base one, the gold (or brown) dots represent base two, and the blue dots represent base three (often called the "wobble" base, for historical reasons). Not unexpectedly, the greatest G+C shift occurs in base three (as is usually the case). What's perhaps more surprising is the clear preference for purines in base one. The red cluster centers at y = 0.6051 plus or minus 0.0467 (standard deviation). This means that on average, position one of a codon is occupied by a purine (A or G) over 60% of the time. This is actually quite typical of codons in most organisms. I've looked at over 1,300 bacterial species so far, and in all of them, purines accumulate at codon base one. (Maybe in a future post, I'll present more data to this effect.)

Base two segregates out as having a G+C content significantly below the organism's total-genome G+C content and centers on y = 0.4434 (median) plus or minus 0.0547 (SD).

Now compare the above graph with a similar graph for M. leprae's pseudogenes:

Base composition of M. leprae pseudogenes by codon position. (n=2233) Again, red dots are for base one, gold are for base two, blue are for base three. Click to enlarge. See text for discussion.

Here, it's evident that base compositions for all three codon positions overlap significantly. The fact that the codon positions are no longer clearly defined in their spatial representation on this graph is consistent with widespread frameshift mutations in the DNA, causing bases that would normally be in position one (or two or three) to be in some other position, randomly.

Hence we can say, with some confidence, on the basis of these graphs, that many (if not most) of the "junk genes" in M. leprae harbor frameshift mutations. The question of which came first—frameshift mutations, or silencing of genes (followed by frameshifts)—is still open. But we know for certain frameshifts are indeed rampant in the M. leprae pseudogenome.

Exactly how or why M. leprae accumulated so many frameshift mutations (and then kept hoarding the mutated genes) is unknown. As I said earlier, much work remains to be done.

Note: Graphs were produced using the excellent service at ZunZun.com. Hand-editing of SVG graphs (before conversion to PNG) enabled easy modification of the data-point colors in a text editor. Data points were plotted with opacity = 0.30 so that areas of high overlap are more apparent visually (with the piling of data points on top of data points).

Bioinformaticists (and others!), feel free to leave a comment below.