Showing posts with label base composition. Show all posts
Showing posts with label base composition. Show all posts

Sunday, April 13, 2014

Whooping Cough Genomics

Pertussis, also known as whooping cough, is a highly contagious respiratory infection caused by Bordetella pertussis, a small aerobic bacterium that secretes numerous toxins capable of disrupting a normal immune response. The disease is rarely fatal but leaves victims with a nasty cough that can last weeks. In 2012, in the U.S., some 48,277 cases of pertussis were reported to the CDC. Of those cases, only 20 were fatal. By contrast, 28 Americans were killed by lightning the same year.

Bordetella pertussis
Unlike tuberculosis (which has been with us for 3 million years), Bordetella shows evidence of being a fairly new (and still rapidly evolving) pathogen, although in this case "fairly new" could still mean 700,000 years.

The complete DNA sequence of B. pertussis has been available for several years. It shows a moderate-size genome (of 4 million base pairs) encoding 3,447 genes, with a substantial number (360) of pseudogenes. The latter represent genes that have (by one means or another) been inactivated, whether through the appearance of premature stop codons in the gene, loss of a promoter region, random deletions, or what have you.

What makes Bordetella's pseudogenes interesting is that they're in remarkably good shape, as pseudogenes go. Usually, once a gene gets inactivated (goes pseudo), it begins to accumulate random point mutations, deletions, insertions, etc. at a substantial rate. In other words it deteriorates, since (supposedly) it's no longer under selection pressure. But when Australian researchers looked at 358 pseudogenes in B. pertussis Tohama I strain, they were shocked to find that the rate of nucleotide polymorphisms (i.e., changes to individual base-pairs in the DNA) was actually lower in pseudogenes than in regular genes (4.7E-5 per site versus 5.1). That's exactly the opposite of what's expected. The researchers commented, somewhat laconically: "This suggests that most pseudogenes in B. pertussis were formed in the recent past and are yet to accumulate more mutations than functional genes."

What other explanation is there? Well, the most obvious alternative explanation is that the genes are still under selection pressure, even though they're turned off. How can that be? I can think of any number of scenarios; perhaps that'll be a future blog post. Suffice it for now to say, ribosomes are not totally unforgiving of missing stop codons (read up on tmRNA) nor are they unforgiving, in all cases, of frameshifts (read about programmed frameshifts), and if an open reading frame should appear on a pseudogene's antisense strand, you now have an RNA silencer (potentially) for the remaining good copy or copies of the gene, with attendant gene-modulation possibilities.

It's worth pointing out that pseudogenes in M. leprae (the leprosy bacterium) are not only conserved and ancient but continue to show strong homology to working orthologues in M. tuberculosis (and even more distantly related organisms such as Gordonia, Corynebacterium, and Nocardia) after millions of years. More of which, in a later post.

For now, I thought it might be worth looking at the base composition of B. pertussis pseudogenes to see if they're riddled with frameshift errors (as is the case with M. leprae's pseudogenes). When I analyzed all 1,125,521 codons for all normal (not pseudo) genes in B. pertussis Tohama I strain, the resulting "paintball diagram" of base composition came up looking like this:
Paintball diagram for normal genes in B. pertussis Tohama I (click to enlarge). Red dots are for codon base one, gold represents the composition at codon base two, blue is "wobble" (third) base composition. Every dot represents statistics for one gene (n=3447). See text for discussion.

Here, we're looking at purine (A+G) content versus G+C content for each base position in the codons. Every dot represents a gene's worth of data. Not unexpectedly, the most extreme G+C values occur in the third ("wobble") base. Codon base one (red dots) is purine-rich, centering on y=0.58. This is typical of most codons in most genes, in most organisms. Notice the "breakaway cloud" of gold points underneath the main gold cloud (at y<0.4). These points represent genes in which the second codon base is mostly a pyrimidine (C or T). Codons with a pyrimidine in base two tend to code for nonpolar amino acids. Thus, the breakaway cloud of gold points represents membrane-associated proteins. In this case, we're looking at about 558 genes falling in that category.

Now look at the paintball diagram for the organism's 360 pseudogenes:

Base composition for "codons" in 360 pseudogenes of B. pertussis Tohama I. (Click to enlarge.) In this graph, as in the one above, dots are rendered with an opacity of 60% (so that overlapping points are less likely to obscure each other). See text for discussion.

In this case, there's a considerable amount of random statistical splay, but some of that is due simply to the fact that pseudogenes are a good deal shorter than normal genes, giving rise to more noise in the signal. (In this case, the average length of a pseudogene is 482 bases, vs. 982 for the 3,447 "normal" genes.) Even with considerable noise, though, it's apparent that the dot clusters tend to center on different parts of the graph, corresponding to the expected locations for normal genes. (Contrast this with the situation in M. leprae, where pseudogenes are riddled with frameshifts, rendering the concept of "codon base position" moot. Refer to the second paintball graph on this page.) Thus, we can say with some confidence that frameshifts are not so rampant in B. pertussis pseudogenes as to have rendered the concept of codons irrelevant. In fact, compared to M. leprae, pseudogenes in B. pertussis are comparatively unaffected by frameshifts. This tends to support the view of the Australian researchers (mentioned earlier) that pseudogenes in B. pertussis have not had enough time to accumulate very many mutations. But it can also be hypothesized that B. pertussis has had plenty of time (700,000 years, in fact) in which to accumulate mutations in its pseudogenes, yet has not done so. The evidence suggests that if anything, Bordetella repairs pseudogenes even more faithfully than regular genes.

At this point it might be relevant to interject that while M. leprae (like other members of the Mycobacteria) lacks the MutS/MutL mismatch repair system, Bordetella does, in fact, have a MutS/MutL mismatch repair system, and this may explain the relative paucity of frameshift errors in Bordetella pseudogenes. But it also implies (rather queerly) that Bordetella goes out of its way to repair its pseudogenes.

Interestingly, 234 out of 360 pseudogenes have a AG1 (purine, base one) content greater than 55%, which means they're probably still "in frame." Of these 234, some 69 (30%) have AG2 less than 40%, meaning they're most likely genes for membrane-associated proteins. If we look at the 2,456 normal genes that have AG1 greater than 55%, only 398 (16%) are putative membrane-associated proteins (with AG2 less than 40%). Bottom line: Pseudogenes for putative membrane-associated proteins are twice as likely to still be in-frame. While this could be a statistical fluke, it could also be that membrane proteins are somehow "spared" preferentially when it comes to leaving pseudogenes translatable. To put it differently: Pseudogenes for non-membrane-associated proteins are less likely to remain in-frame. This makes sense, in that much of Bordetella's pathogenicity can be ascribed to proteins that make up cell-surface antigens or that transport toxins to the outside world. Some of the toxic surface proteins may, in fact, be nonsense (or partial-nonsense) proteins—products of pseudogenes.

Sunday, April 06, 2014

A binary signal in the second codon base

In looking at base composition statistics of codons, an amazing fact jumps out.

If you look carefully at the following graph, you can see that the cloud of gold-colored data points (representing the compositional stats for the second base in codons of Clostridium botulinum) has a second, "breakaway" cloud underneath it. (See arrow.)

Codon base composition statistics for Clostridium botulinum. Notice the breakaway cloud of gold points under the main cloud (arrow). These points represent genes in which most codons have a pyrimidine in the middle base of each codon.

To review: I made this plot by going through each DNA sequence for each coding ("CDS") gene of C. botulinum, and for each gene, I went through all codons and calculated the average purine content (as well as the average G+C content) of the bases at a positions one, two, and three in the codons. Thus, every dot represents the stats for one gene's worth of data.

After looking at graphs of this sort, three key facts about codon bases leap out:
  • Most codons, for most genes, have a purine as the first base (notice how the red cloud of points is higher than the others, centering on y=0.7).
  • The third base (often called the "wobble" base; shown blue here) has the most extreme G+C value. (This is well known.)
  • The middle base falls in one of two positions (high or low) on the purine (y) axis. There's a primary cloud of data points and a secondary cloud in a distinct region below the main cloud. The secondary cloud of gold points is centered at about 0.3 on the y-axis, meaning these are genes in which the second codon base tends to be a pyrimidine
The question is: What does it mean when you look at a gene with 200 or 300 or 400 codons, and the majority of codons have a pyrimidine in the second base?

If you examine the standard codon translation table (below), you can see that codons with a pyrimidine in the second position (represented by the first two columns of the table) code primarily for nonpolar amino acids. When a pyrimidine is in the second base, the possible amino acids are phenylalanine, serine, leucine, proline, isoleucine, methionine, threonine, valine, and alanine. Of these, all but serine and threonine are nonpolar. Therefore, a pyrimidine in position two of a codon means there's at least a 75% chance that the amino acid will have a nonpolar, hydrophobic side group.


Virtually all proteins contain some nonpolar amino acids, but when a protein contains mostly nonpolar amino acids, that protein is destined to wind up in the cell membrane (which is largely made up of lipids). Thus, we can expect to find that genes in the "breakaway" cloud of gold points in the graph further above represent membrane-associated proteins.

To check this hypothesis, I wrote a script that harvested the gene names, and protein-product names, of all the "breakaway cloud" data points. After purging genes annotated (unhelpfully) as "hypothetical protein," I was left with 37 "known" genes. They're shown in the following table.

Gene Product
CLC_0571 arsenical pump family protein
CLC_1058 L-lactate permease
CLC_1550 carbohydrate ABC transporter permease
CLC_3115 xanthine/uracil permease family protein
CLC_0813 arsenical-resistance protein
CLC_1018 putative anion ABC transporter, permease protein
CLC_1687 xanthine/uracil permease family protein
CLC_3633 sporulation integral membrane protein YtvI
CLC_2382 phosphate ABC transporter, permease protein PstA
CLC_0189 ZIP transporter family protein
CLC_3351 sodium:dicarboxylate symporter family protein
CLC_0971 cobalt transport protein CbiM
CLC_1534 methionine ABC transporter permease
CLC_2798 xanthine/uracil permease family protein
CLC_1397 manganese/zinc/iron chelate ABC transporter permease
CLC_1836 stage III sporulation protein AD
CLC_0528 high-affinity branched-chain amino acid ABC transporter, permease protein
CLC_0430 electron transport complex, RnfABCDGE type, A subunit
CLC_2523 flagellar biosynthesis protein FliP
CLC_0401 amino acid permease family protein
CLC_0383 lrgB-like family protein
CLC_0457 chromate transporter protein
CLC_0291 sodium:dicarboxylate symporter family protein
CLC_0427 electron transport complex, RnfABCDGE type, D subunit
CLC_1281 putative transcriptional regulator
CLC_2008 ABC transporter, permease protein
CLC_0868 branched-chain amino acid transport system II carrier protein
CLC_1237 monovalent cation:proton antiporter-2 (CPA2) family protein
CLC_1137 methionine ABC transporter permease
CLC_0764 putative drug resistance ABC-2 type transporter, permease protein
CLC_1953 xanthine/uracil permease family protein
CLC_2444 auxin efflux carrier family protein
CLC_0897 putative ABC transporter, permease protein
CLC_1555 C4-dicarboxylate transporter/malic acid transport protein
CLC_0374 xanthine/uracil permease family protein
CLC_0470 undecaprenyl pyrophosphate phosphatase
CLC_2648 monovalent cation:proton antiporter-2 (CPA2) family protein

Notice that with the exception of CLC_1281, a "putative transcriptional regulator," every gene product represents a membrane-associated protein: transporters, carrier proteins, permeases, etc.

I ran the same experiment on genes from Streptomyces griseus (strain XylbKG-1) and came up with 222 genes having high pyrimidine content in base two. All 222 genes specify membrane-associated proteins. (The full list is in a table below.)

The bottom line: Base two of codons acts as a binary switch. If the base is a pyrimidine, the associated amino acid will most likely (75% chance) be nonpolar. If the base is a purine, the codon will either be a stop codon (3 out of 32 codons) or the amino acid will be polar (26 out of 29 codons).

Here's the list of 222 genes from S. griseus in which the middle codon base is predominantly a pyrimidine:

SACT1_0608 ABC-type transporter, integral membrane subunit
SACT1_3730 major facilitator superfamily MFS_1
SACT1_4066 cation efflux protein
SACT1_5911 ABC-2 type transporter
SACT1_6577 SNARE associated protein
SACT1_6966 ABC-type transporter, integral membrane subunit
SACT1_7160 major facilitator superfamily MFS_1
SACT1_3151 NADH-ubiquinone/plastoquinone oxidoreductase chain 6
SACT1_3682 drug resistance transporter, EmrB/QacA subfamily
SACT1_5431 Citrate transporter
SACT1_3199 proton-translocating NADH-quinone oxidoreductase, chain M
SACT1_7301 putative integral membrane protein
SACT1_3198 NAD(P)H-quinone oxidoreductase subunit 2
SACT1_2008 arsenical-resistance protein
SACT1_3149 proton-translocating NADH-quinone oxidoreductase, chain L
SACT1_3967 MATE efflux family protein
SACT1_3148 proton-translocating NADH-quinone oxidoreductase, chain M
SACT1_5571 major facilitator superfamily MFS_1
SACT1_0651 major facilitator superfamily MFS_1
SACT1_2669 ABC-type transporter, integral membrane subunit
SACT1_1805 NADH dehydrogenase (quinone)
SACT1_3147 NAD(P)H-quinone oxidoreductase subunit 2
SACT1_6961 major facilitator superfamily MFS_1
SACT1_0992 ABC-2 type transporter
SACT1_2619 major facilitator superfamily MFS_1
SACT1_0507 major facilitator superfamily MFS_1
SACT1_0649 ABC-type transporter, integral membrane subunit
SACT1_0800 glycosyl transferase family 4
SACT1_1659 ABC-type transporter, integral membrane subunit
SACT1_1803 multiple resistance and pH regulation protein F
SACT1_4190 putative ABC transporter permease protein
SACT1_5522 drug resistance transporter, EmrB/QacA subfamily
SACT1_5568 drug resistance transporter, EmrB/QacA subfamily
SACT1_7248 Lysine exporter protein (LYSE/YGGA)
SACT1_0266 ABC-2 type transporter
SACT1_0847 Na+/solute symporter
SACT1_4378 ABC-type transporter, integral membrane subunit
SACT1_6766 ABC-type transporter, integral membrane subunit
SACT1_2522 putative integral membrane protein
SACT1_4762 amino acid permease-associated region
SACT1_4901 major facilitator superfamily MFS_1
SACT1_2616 multiple antibiotic resistance (MarC)-related protein
SACT1_3961 major facilitator superfamily MFS_1
SACT1_6236 MIP family channel protein
SACT1_1319 protein of unknown function UPF0016
SACT1_2332 copper resistance D domain protein
SACT1_5327 ABC-type transporter, integral membrane subunit
SACT1_5759 ABC-2 type transporter
SACT1_1133 ABC-type transporter, integral membrane subunit
SACT1_1562 ABC-type transporter, integral membrane subunit
SACT1_5518 major facilitator superfamily MFS_1
SACT1_3430 ABC-type transporter, integral membrane subunit
SACT1_4517 major facilitator superfamily MFS_1
SACT1_4565 drug resistance transporter, EmrB/QacA subfamily
SACT1_4994 ABC-type transporter, integral membrane subunit
SACT1_7197 major facilitator superfamily MFS_1
SACT1_5949 major facilitator superfamily MFS_1
SACT1_6233 protein of unknown function DUF6 transmembrane
SACT1_0936 ABC-type transporter, integral membrane subunit
SACT1_4993 2-aminoethylphosphonate ABC transporter, permease protein
SACT1_6954 ABC-type transporter, integral membrane subunit
SACT1_1846 Lysine exporter protein (LYSE/YGGA)
SACT1_3429 ABC-type transporter, integral membrane subunit
SACT1_3957 membrane protein of unknown function
SACT1_4612 ABC-2 type transporter
SACT1_1998 polar amino acid ABC transporter, inner membrane subunit
SACT1_2093 ABC-type transporter, integral membrane subunit
SACT1_4420 ABC-2 type transporter
SACT1_6613 putative ABC transporter permease protein
SACT1_2564 small multidrug resistance protein
SACT1_3669 major facilitator superfamily MFS_1
SACT1_4186 major facilitator superfamily MFS_1
SACT1_4850 ABC-type transporter, integral membrane subunit
SACT1_0206 major facilitator superfamily MFS_1
SACT1_5418 Lysine exporter protein (LYSE/YGGA)
SACT1_0548 ABC-type transporter, integral membrane subunit
SACT1_3332 protein of unknown function DUF6 transmembrane
SACT1_3764 sodium/hydrogen exchanger
SACT1_6278 protein of unknown function DUF6 transmembrane
SACT1_4143 ABC-2 type transporter
SACT1_3232 ABC-type transporter, integral membrane subunit
SACT1_0256 ABC-type transporter, integral membrane subunit
SACT1_2898 major facilitator superfamily MFS_1
SACT1_6510 protein of unknown function DUF6 transmembrane
SACT1_0980 CrcB-like protein
SACT1_1650 ABC-type transporter, integral membrane subunit
SACT1_4658 acyltransferase 3
SACT1_0306 protein of unknown function DUF6 transmembrane
SACT1_2372 major facilitator superfamily MFS_1
SACT1_7238 ABC-type transporter, integral membrane subunit
SACT1_0202 major facilitator superfamily MFS_1
SACT1_0591 ABC-type transporter, integral membrane subunit
SACT1_5369 protein of unknown function DUF81
SACT1_6227 C4-dicarboxylate transporter/malic acid transport protein
SACT1_6755 ABC-type transporter, integral membrane subunit
SACT1_0978 Urea transporter
SACT1_2418 ATP synthase subunit a
SACT1_5604 major facilitator superfamily MFS_1
SACT1_4891 ABC-type transporter, integral membrane subunit
SACT1_6507 protein of unknown function DUF6 transmembrane
SACT1_6754 ABC-type transporter, integral membrane subunit
SACT1_0787 ABC-2 type transporter
SACT1_2848 sodium/hydrogen exchanger
SACT1_0636 ABC-type transporter, integral membrane subunit
SACT1_1891 putative integral membrane protein
SACT1_1552 xanthine permease
SACT1_2894 putative secreted protein
SACT1_4508 sodium:dicarboxylate symporter
SACT1_7091 drug resistance transporter, EmrB/QacA subfamily
SACT1_2652 major facilitator superfamily MFS_1
SACT1_1741 amino acid permease-associated region
SACT1_1838 ABC-type transporter, integral membrane subunit
SACT1_2796 gluconate transporter
SACT1_5220 sodium/hydrogen exchanger
SACT1_6991 ABC-type transporter, integral membrane subunit
SACT1_3273 protein of unknown function DUF894 DitE
SACT1_7089 ABC-type transporter, integral membrane subunit
SACT1_7280 major facilitator superfamily MFS_1
SACT1_3467 major facilitator superfamily MFS_1
SACT1_1304 ABC-type transporter, integral membrane subunit
SACT1_6032 protein of unknown function DUF81
SACT1_4312 major facilitator superfamily MFS_1
SACT1_0876 ABC-type transporter, integral membrane subunit
SACT1_6123 citrate/H+ symporter, CitMHS family
SACT1_4359 Cl- channel voltage-gated family protein
SACT1_7325 branched-chain amino acid transport
SACT1_1160 protein of unknown function DUF140
SACT1_2265 Arsenical pump membrane protein
SACT1_3512 ABC-2 type transporter
SACT1_1018 major facilitator superfamily MFS_1
SACT1_3415 Xanthine/uracil/vitamin C permease
SACT1_5214 BioY protein
SACT1_3656 small multidrug resistance protein
SACT1_3895 SpdD2 protein
SACT1_4929 ABC-type transporter, integral membrane subunit
SACT1_3029 major facilitator superfamily MFS_1
SACT1_6312 ABC-type transporter, integral membrane subunit
SACT1_0919 L-lactate transport
SACT1_4356 ABC-2 type transporter
SACT1_0532 ABC-type transporter, integral membrane subunit
SACT1_6693 secretion protein snm4
SACT1_0967 ABC-type transporter, integral membrane subunit
SACT1_6496 major facilitator superfamily MFS_1
SACT1_6983 major facilitator superfamily permease
SACT1_0917 NADH-ubiquinone/plastoquinone oxidoreductase chain 3
SACT1_6887 major facilitator superfamily MFS_1
SACT1_2835 major facilitator superfamily MFS_1
SACT1_5544 drug resistance transporter, Bcr/CflA subfamily
SACT1_5591 ABC-type transporter, integral membrane subunit
SACT1_1201 ABC-type transporter, integral membrane subunit
SACT1_2404 ABC-2 type transporter
SACT1_0870 protein of unknown function DUF803
SACT1_6933 ABC-type transporter, integral membrane subunit
SACT1_1776 ABC-type transporter, integral membrane subunit
SACT1_3213 major facilitator superfamily MFS_1
SACT1_4210 phosphate ABC transporter, inner membrane subunit PstC
SACT1_5398 protein of unknown function DUF81
SACT1_0914 NADH-ubiquinone/plastoquinone oxidoreductase chain 6
SACT1_3261 major facilitator superfamily MFS_1
SACT1_6932 ABC-type transporter, integral membrane subunit
SACT1_0669 major facilitator superfamily MFS_1
SACT1_4255 ABC-2 type transporter
SACT1_4541 major facilitator superfamily MFS_1
SACT1_4638 major facilitator superfamily MFS_1
SACT1_0913 NADH-ubiquinone oxidoreductase chain 4L
SACT1_4443 protein of unknown function DUF81
SACT1_5396 Xanthine/uracil/vitamin C permease
SACT1_0912 NADH dehydrogenase (quinone)
SACT1_2924 Lysine exporter protein (LYSE/YGGA)
SACT1_5922 putative integral membrane protein
SACT1_1243 Bile acid:sodium symporter
SACT1_6967 ABC-type transporter, integral membrane subunit
SACT1_0911 proton-translocating NADH-quinone oxidoreductase, chain M
SACT1_3931 putative ABC transporter permease protein
SACT1_2820 ABC-2 type transporter
SACT1_3298 putative integral membrane transport protein
SACT1_4871 major facilitator superfamily MFS_1
SACT1_5873 major facilitator superfamily MFS_1
SACT1_6636 putative integral membrane protein
SACT1_0905 AbgT transporter
SACT1_5532 2-dehydro-3-deoxyphosphogluconate aldolase/4-hydroxy-2-oxoglutarate aldolase
SACT1_0910 proton-translocating NADH-quinone oxidoreductase, chain N
SACT1_2115 ABC-type transporter, integral membrane subunit
SACT1_4635 protein of unknown function DUF6 transmembrane
SACT1_5777 major facilitator superfamily MFS_1
SACT1_3979 major facilitator superfamily MFS_1
SACT1_5536 major facilitator superfamily MFS_1
SACT1_6782 major facilitator superfamily MFS_1
SACT1_0616 virulence factor MVIN family protein
SACT1_4869 ABC-type transporter, integral membrane subunit
SACT1_1581 putative integral membrane protein
SACT1_4585 major facilitator superfamily MFS_1
SACT1_6536 small multidrug resistance protein
SACT1_4024 cell cycle protein
SACT1_5296 major facilitator superfamily MFS_1
SACT1_1865 major facilitator superfamily MFS_1
SACT1_4868 ABC-type transporter, integral membrane subunit
SACT1_0955 protein of unknown function DUF803
SACT1_4296 major facilitator superfamily MFS_1
SACT1_5104 major facilitator superfamily MFS_1
SACT1_0519 Bile acid:sodium symporter
SACT1_2394 putative ABC transporter permease protein
SACT1_0661 major facilitator superfamily MFS_1
SACT1_2062 major facilitator superfamily MFS_1
SACT1_4295 ABC-type transporter, integral membrane subunit
SACT1_6828 peptidase M48 Ste24p
SACT1_3446 major facilitator superfamily MFS_1
SACT1_6631 Lysine exporter protein (LYSE/YGGA)
SACT1_1048 ABC-type transporter, integral membrane subunit
SACT1_1528 protein of unknown function DUF6 transmembrane
SACT1_2016 branched-chain amino acid transport
SACT1_6154 gluconate transporter
SACT1_5051 major facilitator superfamily MFS_1
SACT1_5531 protein of unknown function DUF81
SACT1_6480 protein of unknown function DUF1290
SACT1_0373 ABC-type transporter, integral membrane subunit
SACT1_2392 putative ABC transporter permease protein
SACT1_2724 protein of unknown function DUF107
SACT1_7257 protein of unknown function UPF0118
SACT1_2772 putative integral membrane protein
SACT1_3201 NAD(P)H-quinone oxidoreductase subunit 4L
SACT1_7066 ABC-type transporter, integral membrane subunit

Some of these genes are labeled "protein of unknown function," but I think we can predict with high confidence, based on what we know about these proteins (namely, that they're hydrophobic) that the gene products in question involve membrane-associated functions.

Bioinformatics geeks, leave a comment below.

Wednesday, April 02, 2014

Frameshift errors in leprosy bacterium DNA

Shocking as it might sound, leprosy continues to strike over 200,000 persons per year worldwide, making it as much of a health problem as cholera or yellow fever. One of the oldest known infectious diseases, leprosy became the first disease to be causally linked to bacteria when Hansen made his famous discovery of the connection to Mycobacterium leprae in 1873. Ever since then, scientists have been trying to grow M leprae in the lab, to no avail. Like most environmental isolates, M. leprae defies attempts at pure culture. The only way to grow it in the lab is to infect mice or armadillos, where it has a doubling time of 14 days, the longest known generation time of any bacterium.

Traditionally, it has been assumed that the difficulty in growing M. leprae in pure culture is due to the organism's complex nutritional requirements. (In humans, the organism is an obligate intracellular parasite that takes up residency in the Schwann cells of the peripheral nervous system.) There is no doubt considerable truth to this assumption, but the reason for the organism's fastidious nutritional requirements wasn't fully known until Cole et al. (2001) showed that half the bacterium's genome is inoperative and undergoing decay. Genomic sequencing revealed that M. leprae has only three quarters the DNA content of its (quite robust) cousin, M. tuberculosis, and of M. leprae's 3,000-or-so remaining genes, only 1,600 are fully functional. The rest are pseudogenes.

Pseudogenes are genes that have become inactivated through loss of start codons, loss of promoter regions, introduction of spurious stop codons, introduction of frameshift errors, or through other causes. Almost all organisms contain pseudogenes in their DNA. (Human DNA reportedly contains over 12,000 pseudogenes.) The leprosy bacterium, however, is unique in having approximately half its genome tied up in pseudogenes. Once a gene becomes a pseudogene, it is effectively useless baggage ("junk DNA") and continues on a long path of deterioration. Evolutionary theory predicts that such genes will eventually be lost from the genome, since the carrying cost of keeping them puts the organism at a disadvantage, energetically. But the curious thing about M. leprae is that it's a hoarder: It not only holds onto its useless genes, it actually transcribes upwards of 40% of them. In fact, a recent study of 1000-year-old M. leprae DNA (recovered from medieval skeletons), comparing the medieval version of the organism's genome with the genome of today's M. leprae, found that pseudogenes are highly conserved in the bacterium.

The fact that the bacterium actually transcribes many of its pseudogenes (and doesn't lose them over time) is striking, to say the least, and suggests that the transcription of certain genes or pseudogenes is resulting in mRNAs that silence other, more deleterious genes.  It could be that M. leprae can't be grown in culture because when certain combinations of nutrients are presented to it, the nutrients up-regulate deleterious nonsense genes in otherwise-normal operons (or down-regulate important silencers), directly or indirectly. (Williams et al. found that many M. leprae pseudogenes are located in the middle of operons and are transcribed via fortuitous read-through.) Various scenarios are possible. Much work remains to be done.

In the meantime, I couldn't help doing a little desktop science to characterize M. leprae's "defective genes" problem further. I went to http://genomevolution.org/CoGe/OrganismView.pl and entered "Mycobacterium leprae Br4923" in the Organism Name field. In the Genome Information box, if you click the "Click for Features" link, you can see that 1604 genes are labeled "CDS" (meaning, these are the operative, non-defective genes) while a separate line item shows an utterly astounding 2233 genes as pseudogenes. (Addendum: The FASTA file at genomevolution.org contains duplicates. The actual pseudogene count, it turns out, is 1116, not 2233. But still, 1116 is a huge number of pseudogenes.) The "DNA Seqs" links on the right side of that page allow you to download the FASTA sequences for the respective gene groupings. These are simple text files containing the base sequences (A, T, G, and C) for the coding strands of the genes.

I wrote a few lines of JavaScript to analyze the base compositions of the genes (and pseudogenes), and what I noticed immediately is that the base composition differs for the two groups:

Base Content (Genes) Content (Pseudogenes)
A
0.1938
0.2119
G
0.3116
0.2867
C
0.2890
0.2778
T
0.2046
0.2223

The G+C content for the "normal" genes averages 60.6%, whereas for the pseudogenes it's 55.4%. A typical G+C value for other members of the genus Mycobacterium is 65%. Thus, it's clear that not only the pseudogenes but the "normal" genes of M. leprae have drifted in the direction of more A+T. This has been noted before (by Cole et al. and others). What's perhaps less obvious is that purine content (A+G) has shifted from 50.5% in the normal genes to 49.8% in the pseudogenes. Bear in mind we're looking at data for one strand of DNA: the so-called coding or "message" strand.

Clearly, there is a tendency for pseudogenes to "regress to the mean." But the shift in purine concentration is particularly interesting, because it indicates that purine usage in normal-gene coding regions is perhaps non-randomly elevated. The shift from 50.5% to 49.8% in A+G content may not seem particularly striking on its own, but the difference, it turns out, is highly significant. You can see why in the following graph.

Base composition of "normal" genes in M. leprae (total purines vs. G+C) by codon base position. (n=1604) Red dots are for base one, gold dots are for base two, blue dots are for base 3 (the "wobble" base). Click to enlarge. See text for discussion.

To make this graph, I looked at the DNA of the coding regions of "normal" genes and determined the average purine content as well as the G+C content for positions one, two, and three of all codons. As you can see, the purine content (relative to the G+C content) segregates non-randomly according to codon base position. The red dots represent base one, the gold (or brown) dots represent base two, and the blue dots represent base three (often called the "wobble" base, for historical reasons). Not unexpectedly, the greatest G+C shift occurs in base three (as is usually the case). What's perhaps more surprising is the clear preference for purines in base one. The red cluster centers at y = 0.6051 plus or minus 0.0467 (standard deviation). This means that on average, position one of a codon is occupied by a purine (A or G) over 60% of the time. This is actually quite typical of codons in most organisms. I've looked at over 1,300 bacterial species so far, and in all of them, purines accumulate at codon base one. (Maybe in a future post, I'll present more data to this effect.)

Base two segregates out as having a G+C content significantly below the organism's total-genome G+C content and centers on y = 0.4434 (median) plus or minus 0.0547 (SD).

Now compare the above graph with a similar graph for M. leprae's pseudogenes:

Base composition of M. leprae pseudogenes by codon position. (n=2233) Again, red dots are for base one, gold are for base two, blue are for base three. Click to enlarge. See text for discussion.

Here, it's evident that base compositions for all three codon positions overlap significantly. The fact that the codon positions are no longer clearly defined in their spatial representation on this graph is consistent with widespread frameshift mutations in the DNA, causing bases that would normally be in position one (or two or three) to be in some other position, randomly.

Hence we can say, with some confidence, on the basis of these graphs, that many (if not most) of the "junk genes" in M. leprae harbor frameshift mutations. The question of which came first—frameshift mutations, or silencing of genes (followed by frameshifts)—is still open. But we know for certain frameshifts are indeed rampant in the M. leprae pseudogenome.

Exactly how or why M. leprae accumulated so many frameshift mutations (and then kept hoarding the mutated genes) is unknown. As I said earlier, much work remains to be done.

Note: Graphs were produced using the excellent service at ZunZun.com. Hand-editing of SVG graphs (before conversion to PNG) enabled easy modification of the data-point colors in a text editor. Data points were plotted with opacity = 0.30 so that areas of high overlap are more apparent visually (with the piling of data points on top of data points).

Bioinformaticists (and others!), feel free to leave a comment below.