Showing posts with label quadruplet decoding. Show all posts
Showing posts with label quadruplet decoding. Show all posts

Wednesday, April 09, 2014

Are dead genes still alive in the leprosy bacillus?

The genome of the leprosy bacterium (Mycobacterium leprae) stands as a remarkable example of DNA in an apparent state of massive, wholesale breakdown. Of the organism's 2720 genes, only 1604 appear to be functional, while 1116 are pseudogenes, which is to say genes that have been "turned off" and left for dead.

Genes can become pseudogenes in any number of ways, including loss of a start codon, loss of promoter regions (or degraded Shine Dalgarno signals), random insertions and deletions, mutations that cause spurious stop codons, and so on. Once a gene gets "turned off," assuming loss of the gene in question isn't fatal, the gene typically undergoes a period of degradation (leading to its eventual loss from the genome), but that's not exactly what we see in the leprosy bacterium. When leprosy germs from medieval skeletons were sampled and their genomes sequenced, researchers found that pseudogenes in M. leprae haven't changed very much in the past thousand years or so. Not only does M. leprae tend to hold onto its pseudogenes, it actively transcribes upwards of 40% of them. Probably not all of the transcripts result in expressed proteins (many lack a start codon!), but some no doubt do get translated into proteins. Let's put it this way: It would be extremely unusual for an organism to conserve this many pseudogenes if none of them was doing anything useful.

This view of a segment of the two genomes shows how a region of around 80,000 base pairs in M. tuberculosis maps to a similar 68,000-base-pair region of M. leprae. Notice that in the lowermost panel (representing M. leprae), many genes are shown as shrunken silver segments instead of fat green cylinders. The smaller grey/silver segments are pseudogenes. Click to enlarge.

To get a better idea of what's going on here, I downloaded the DNA sequences of M. leprae's 1604 "normal" genes as well as the 1116 pseudogenes. In analyzing the codons for these genes, I looked for signs of genes that were still in the normal reading frame. One way to detect this is by measuring the purine content at the various base positions in a gene's codons. In a typical protein-coding gene, around 60% of codons begin with A or G (adenine or guanine). This positional bias will, of course, be lost in a gene that has undergone frameshift mutations. Among M. leprae's 1116 pseudogenes, I found 269 in which codons showed an average AG1 percentage (A+G content, codon base one) of 55% or more. These are pseudogenes that appear to still be mostly "in frame."

Things get a lot more interesting where putative membrane proteins are concerned. In a previous post, I showed that in some genes, the second codon base is pyrimidine-rich (i.e., predominantly C or T: cytosine or thymine); these genes encode proteins with a high percentage of nonpolar amino acids. Bottom line, if a gene's codons are mostly T or C in the second position, that gene most likely encodes a membrane-associated protein. (See my previous post for some data.) This is true for all organisms (viruses, cells) and organellar genes, too, by the way, not just M. leprae. It's a generic feature of the genetic code.

When I segregated M. leprae pseudogenes according to whether or not the second codon base was (on average) less than, or more than, 40% purines, I stumbled onto something quite interesting. I found 51 pseudogenes with AG2 less than 40% (meaning, these are probably membrane-associated proteins). Of those, 32 (or 62%) are still "in frame," with AG1 > 55%. By contrast, the majority (78%) of non-membrane pseudogenes (AG2 > 40%) appear to be turned off, with an average AG1 of 51%.

Long story short: Most non-membrane-associated pseudogenes are out-of-frame (and likely dead), whereas 62% of putative membrane-associated pseudogenes appear to be in-frame, and therefore could still be functional (or at least, undead).

In looking at stop codons, I found that of the pseudogenes that still had stop codons, the average distance to the first stop codon is only 149 bases (whereas the average pseudogene length is 795 bases). Pseudogenes for putative membrane-associated proteins were shorter overall (as membrane proteins often are; 495 bases instead of 795), but the average distance to the first stop codon was 190 bases, significantly longer than for the other pseudogenes. This suggests some of them are still alive.

By now you're probably wondering how the heck a pseudogene can be of any possible use whatsoever when it contains a premature stop codon. The thing we need to ask, though, is why M. leprae tolerates (indeed conserves) so many pseudogenes in the first place. Could it be that the organism has adapted a frameshift-tolerant translation apparatus? Maybe some of the stop codons aren't really stop codons.

We know that a wide variety of organisms (not just viruses, where this phenomenon was first discovered, but bacteria and eukaryotes) have evolved special signals to tell ribosomes to shift in and out of frame by plus or minus one. (See "A Gripping Tale of Ribosomal Frameshifting: Extragenic Suppressors of Frameshift Mutations Spotlight P-Site Realignment," Atkins and Björk, Microbiol. Mol. Biol. Rev. 2009.) Certain tRNAs participate in "quadruplet codon" decoding, making it possible for special frameshift signals to work. The signals usually involve 7-base-long "slippery heptamer" sequences, such as CCCTGAC, right where a stop codon (TGA) appears. In other words, when a stop codon appears inside a slippery heptamer, it's not really a stop codon. Depending on the kinds (and amounts) of tRNAs "on duty," it can be a frameshift signal.

When I looked for CCCTGAC in M. leprae's pseudogenes, I found 16 in-frame occurrences of the sequence in 1116 pseudogenes. (Only 7 occurrences of the hexamer CCCTGA were found, in frame, in M. leprae's "normal" genes.) While this doesn't prove that M. leprae is up to any unusual translation tricks, it's a tantalizing result. Also bear in mind, if M. leprae is indeed up to some unusual tricks, it may very well be using frameshift signals other than (or in addition to) CCCTGAC. The fact that Mycobacterium species lack a MutS/MutL mismatch repair system means M. leprae may have adapted different ways of coping with "slippery repeats."

Further work will be needed to confirm whether M. leprae indeed translates some of its pseudogenes into proteins. The 32 "high likelihood" pseudogenes that, according to my analysis, might still encode functional (or at least expressed) membrane-associated proteins are shown in the table below. Leave a comment if you have additional thoughts.

M. leprae pseudogenes that have codons with overall AG1 > 55% and AG2 < 40%:

Pseudogene Possible product
MLBr00146 hypothetical protein
MLBr00189 hypothetical protein
MLBr00278 conserved hypothetical protein
MLBr00341 hypothetical protein
MLBr00460 hypothetical protein
MLBr00478 hypothetical protein
MLBr00738 PstA component of phosphate uptake
MLBr00836 hypothetical protein
MLBr00846 ABC transporter
MLBr01054 possible PPE-family protein
MLBr01156 hypothetical protein
MLBr01237 possible cytochrome P450
MLBr01238 probable cytochrome P450
MLBr01400 possible membrane protein
MLBr01414 PGRS-family protein
MLBr01474 hypothetical protein
MLBr01527 dihydrodipicolinate reductase
MLBr01673 conserved hypothetical protein
MLBr01792 probable Na+/H+ exchanger
MLBr01968 PE family protein
MLBr02003 probable ketoacyl reductase
MLBr02101 conserved hypothetical protein
MLBr02150 molybdopterin converting factor subunit 1
MLBr02190 PstA component of phosphate uptake
MLBr02216 dihydrolipoamide dehydrogenase
MLBr02363 19 kDa antigenic lipoprotein
MLBr02477 PE protein
MLBr02484 transcriptional regulator (LysR family)
MLBr02533 PE-family protein
MLBr02656 conserved hypothetical protein
MLBr02674 possible membrane protein

Tuesday, April 08, 2014

Why do so many codons begin with a purine?

With the advent of sites like genomevolution.org (where you can download genomes, create synteny graphs, run BLAST searches, and do all sorts of desktop bioinformatics), it's ridiculously easy for someone interested in comparative genomics to . . . well, compare genomes, for one thing. And if you look at enough gene sequences, a couple of things pop out.

One thing that pops out is that most codons, in most genes, begin with a purine (namely A or G: adenine or guanine). Also, codons typically show the greatest GC swing in base number three. These trends can be seen in the chart below, where I show average base composition (by codon position) for three well-studied organisms. For clarity, base-one purines are shown in bold and base-three G and C are shown highlighted in yellow.

Organism
Codon base
A
G
C
T
S. griseus
1
0.166 0.434 0.287 0.112
2
0.224 0.219 0.295 0.261
3
0.037 0.394 0.530 0.038
E. coli
1
0.256 0.343 0.238 0.161
2
0.291 0.181 0.222 0.304
3
0.186 0.285 0.261 0.265
C. botulinum
1
0.395 0.299 0.094 0.210
2
0.374 0.136 0.161 0.328
3
0.442 0.108 0.064 0.383

Streptomyces griseus is a common soil bacterium that happens to have very high genomic G+C content (72.1% overall, although you can see that in base three of codons the G+C content is more like 92%).

E. coli represents a middle-of-the-road organism in terms of G+C content (50.8% overall), while our ugly friend Clostridium botulinum (the soil organism that can ruin your whole day if it finds its way into a can of tomatoes) has very low genomic G+C content (around 28%).

Even though these organisms differ greatly in G+C content, they all illustrate the (universal) trend toward usage of purines (A or G) in the first position of a codon. Something like 59% to 69% of the time (depending on the organism), codons look like Pnn, where P is a purine base and 'n' represents any base. This is true for viruses as well as cellular genomes.

This pattern is so universal, one wonders why it exists. I think a credible, parsimonious explanation is that when protein-coding genes look like PnnPnnPnn... (etc.) it makes for a crisp reading frame. It's easy to see that a +1 frameshift results in a repeating nPn pattern and a +2 frameshift results in repeats of nnP. These are easily distinguished from Pnn.

There are benefits for a PnnPnnPnn... reading frame. In a previous post, I showed that when most of a gene's codons have a pyrimidine in base two, the resulting protein gets shipped to the cell membrane. (This is a simple consequence of the fact that codons with a pyrimidine in position two tend to code for hydrophobic, lipid-soluble amino acids.) Because a +1 reading-frame shift produces repeats of nPn, the Pnn "default" pattern means that +1 frameshifted gene products, if they occur, won't get shipped to the cell membrane. This is an extremely important outcome, because membrane proteins are, in general, highly transcribed and under strong selective pressure. In addition to specifying antigenic properties and determining phage resistance, membrane proteins make up proton pumps, secretion systems, symporters, kinases, flagellar components, and many other kinds of proteins. They determine the cell's "interface" to the world. They also maintain cell osmolarity and membrane redox potential. Messing with membrane proteins is bound to be risky. Much better to keep frameshifted nonsense proteins away from the membrane.

Fairly strong support for this notion (that Pnn codons provide a crisp reading frame) comes from studies of naturally occurring frameshift signals in DNA. We now know that in many organisms, certain "slippery" DNA signals (usually heptamers, like CCCTGAC) instruct the ribosome to change reading frames. (See, for example, "A Gripping Tale of Ribosomal Frameshifting: Extragenic Suppressors of Frameshift Mutations Spotlight P-Site Realignment," Atkins and Björk, Microbiol. Mol. Biol. Rev. 2009. Also, for fun, be sure to check out some of the papers on quadruplet decoding, which leaves room for alien life forms with 200 amino acids instead of 20.) The "slippery heptamer" frameshift signals that have thus far been identified tend to contain runs of pyrimidines.

Also tending to support the "Pnn = crisp reading frame" notion is the fact that stop codons (TGA, TAA, TAG) look like pPP (where 'p' is a pyrimidine and 'P' is a purine). Again, a crisp distinction.

As for why purines were chosen (and not pyrimidines) to begin the Xxx pattern, again I think a fairly parsimonious answer is available: ATP and GTP are the most abundant nucleoside-triphosphates in vivo. These are the energy sources for nucleic-acid and protein synthesis, respectively.

A prediction: If we run into an alien life form (in the oceans of Europa, say) and it turns out to be the case that UTP (instead of ATP) is the "universal energy molecule" in that life form's cells, then that life form's codons will probably begin with U and form Unn triplets (or Unnn quadruplets, perhaps) a higher-than-average percentage of the time.