Tuesday, April 22, 2014

How Antisense Genes Are Discovered

In the past ten years or so, a great deal of research has focused on antisense transcription of genes. Normally, RNA gets transcribed from one strand of DNA only. But it turns out, in many cases RNA also gets transcribed off the opposite strand of DNA (an antisense copy), either at the original gene (so-called cis transcription) or at a copy of the gene some distance away (trans transcription). The latter can be a pseudogene, or a normal copy of the gene.

Antisense transcripts occur very widely not only in human DNA but in bacteria, yeast, and (in fact) every place where scientists have looked, and places where they haven't looked. Some of the most interesting discoveries have happened when researchers weren't specifically looking for antisense transcripts but found them by accident. How does that happen? It happens in experiments involving IVET (in vivo expression technology), an important experimental technique for uncovering new genes.

IVET is a powerful gene manipulation strategy for discovering which genes in an organism (a pathogen, usually) are up-regulated or turned on during host infection. Let's say you're studying a new pathogen and you want to get an idea of which genes, in the pathogen, are turned on during the infection process. First, you need a strain of the organism that's disabled by virtue of lacking a working copy of a particular metabolic enzyme, say an enzyme needed for purine metabolism, e.g. purA. Secondly, you need a vector for inserting a promoterless copy of the working gene into the bacterium. What this usually means is, you need a plasmid (a small extra chromosome; many bacteria have them, and they can often be manipulated in the lab) on which to place a functional purA gene. The gene won't be expressed, however, if it lacks a suitable promoter region on the DNA upstream of the gene. That's good; that's what you want. You want to put a promoterless copy of the good gene on the plasmid, along with (this is crucial) a random chunk of DNA from the pathogen, inserted ahead of purA on the plasmid. In practice, it's easy to create a bunch of plasmids with this arrangement: a working copy of purA, and ahead of it, a random chunk of pathogen DNA. The idea is that you now attempt to infect a lab animal with the bacterium containing the plasmid. If the bacterium establishes infection in the animal, presumably it's because a random chunk of DNA happened to contain a promoter region (and associated downstream genes) that gets turned on during infection. If you now isolate the bacterium from the sick animal, you can look to see what kind(s) of genes got transduced into the bacterium.

IVET is a promoter trap technology for selecting bacterial genes that are specifically induced when bacteria infect a host organism. A plasmid vetor contains a random fragment of the chromosome of the pathogen (red) and a promoterless gene (selective marker, burgundy) that encodes an enzyme required for survival. Pooled plasmid-containing clones are inoculated into the mouse (B). Only those bacteria that contain the selective marker fused to a random gene that is transcriptionally active in the host are able to survive. After a suitable infection period, bacteria that express the marker are isolated from the spleen or other organs. The inclusion of a lacZY mutant gene (blue) allows post-selection screening for promoters that are active only in vivo. What you want are bacteria that are lac-positive only in the host environment, not "constitutive" (always-on).
Exactly this sort of technique was used by Silby, Rainey, and Levy to determine which genes were activated in Pseudomonas during colonization of soil. (The IVET technique can be adapted to any scenario in which an organism differentially expresses genes in its adaptation to a "host" environment, even if the environment is, in fact, a plant, or soil in this case, rather than a mouse.) They were looking to see which genes in Pseudomonas play an essential role in that organism's ability to thrive in soil, and they successfully identified more than 50 promoters (and associated fusions) that come alive during soil colonization. When they looked at 22 "soil genes" that got turned on, they found ten previously undescribed genes that were transcribed in the antisense direction from regions overlapping known genes. They called these ten genes "cryptic fusions" because of their un-annotated existence on the supposedly silent, antisense side of known genes.

Cryptic fusions discovered by Silby et al. are shown in grey, in their antisense orientation to known genes (darker grey).

It's not unusual to find that antisense transcripts are playing a regulatory role. When a gene gets transcribed in both directions, the resulting sense and antisense RNAs can combine (by Watson-Crick pairing) to form a double-stranded RNA product, preventing translation of the RNA into protein. But incredibly, sometimes an antisense RNA transcript encodes a legitimate protein (a protein that gets made off the antisense copy). Silby and Levy documented this for the previously unknown cosA gene in Pseudomonas. It seems likely additional antisense proteins await discovery. (Most studies stop at the level of identifying RNA products.)

The finding of antisense transcripts in IVET experiments is common. One of the authors of the Pseudomonas study (Rainey) had previously published a study of rhizosphere-induced genes in Pseudomonas but had not published the fact that 20% of genes found this way were in an antisense orientation to normal genes. Likewise, a 1996 study of Pseudomonas aeruginosa infection in the mouse (Pseudomonas is an opportunistic pathogen) found antisense activity. In fact, the first-ever paper on IVET (by Mahan et al., 1993) described finding antiscript products.

IVET has uncovered a previously unknown "antitranscriptome" world hidden inside living cells. Until we explore this world fully, we won't know how much undiscovered biology we've left on the table.

Sunday, April 20, 2014

Bidirectionally Overlapping Genes

The occurrence of bidirectionally overlapping genes in bacteria is rare, and most such examples are dismissed as chimeric or representative of simple genome mis-annotation. After all, how can a gene make sense in one direction, but also make sense on the reverse-reading complementary strand of DNA? Such a situation is more than a mere palindrome. It's akin to the phrase:
Warsaw won, eh?
He now was raw.
The phrase has a sensical message in each direction, yet is not a mere bidi-symmetry of the "A man, a plan, a canal, Panama" kind. It defies credulity to believe a stretch of DNA spanning several hundred bases (several hundred "letters") could evolve to give a useful message in both directions. And yet, what is life itself, if not credulity-defying? Somehow, life began from primordial chemistry and evolved toward DNA genes coding for proteins. Is it so hard to believe that early replicant molecules (probably RNA) were transcribed and translated in both directions, and that some of the happy accidents survived? Is it so hard to believe that some proteins began life as reverse transcripts ("nonsense" proteins) that then evolved toward specialized functionality?

A bonafide example of a bidirectionally transcribed and translated gene was verified experimentally in 2008 by Silby and Levy, who were investigating the soil bacterium Pseudomonas fluorescens PF0-1. They found that the hitherto unknown cosA gene, which overlaps (on the opposite DNA strand) a gene for a fusaric acid resistance protein, is not only expressed as a protein but is required for soil colonization.

A section of P. fluorescens PF0-1 genome showing the existence of overlapping genes (note the yellow-colored segment, representing the cosA gene; the larger green gene above it, on the opposite strand, encodes a fusaric acid resistance protein). The overlapping genes have been shown experimentally to be expressed as protein.
Ironically, a month after Silby and Levy published their results, BMC Genetics published a study by Pallejà et al. looking at large gene overlaps in bacterial genomes. The Pallejà study concluded:
Among the 968 overlaps larger than 60 bps which we analysed, we did not find a single real one among the co-directional and divergent orientations and concluded that there had been an excessive number of misannotations. Only convergent orientation seems to permit some long overlaps, although convergent overlaps are also hampered by misannotations. We propose a simple rule to flag these erroneous gene length predictions to facilitate automatic annotation.
Silby and Levy argue that, to the contrary, current genome annotations are obscuring potentially important discoveries:
[Our] findings suggest that current genome annotations provide an incomplete view of the genetic potential of a given organism . . . In eukaryotes, the concept that genomes include numerous sense/antisense gene pairs is becoming increasingly obvious with genome-wide transcriptional studies in yeast [8] and Arabidopsis [10]. Antisense transcripts have been implicated in eye development [20] and control of entry into meiosis in yeast [21]. However, discussion of antisense transcription is limited to possible regulatory roles for antisense RNA [e.g. 8], without consideration of the possibility that they may specify proteins. Genome annotations do not routinely predict the existence of two protein-coding genes on opposite DNA strands, and in fact normally deliberately eliminate predicted overlaps. Moreover, small protein-coding genes can be missed by predictive algorithms. For example, the blr gene in E. coli specifies a 41 residue protein, and was discovered in a sequence believed to be intergenic [22]. The fact that antisense genes have been implicated in important biological functions indicates that more attention should be given to this emerging class of genes.
I happen to agree with Silby and Levy. It would be a shame if bidirectional overlaps in genomes are not investigated. The notion (furthered by Pallejà) that annotation software should suppress such findings automatically is repulsive. It's the kind of intolerant, rigid, dogmatic thinking science, quite frankly, doesn't need more of.

Saturday, April 19, 2014

Codons and Reverse Complement Codons

A very unusual and surprising property of protein-coding genes is that if a codon A appears with a certain frequency in genes, the reverse-complement codon of A will also have a similar frequency of occurrence. For example: If CTT (leucine) appears at a frequency of 1%, the reverse complement codon AAG (lysine) will also appear at roughly 1%. If CGT (arginine) appears at 0.2%, ACG (threonine) will appear at around 0.2%. (These are whole-genome frequencies.)

This correlation is strongest (r=0.75) for organisms with a high genomic G+C content, such as Streptomyces griseus, and lowest (r=0.28) in low-GC organisms like Clostridium botulinum.

This is a very peculiar property, when you think about it. We don't usually imagine an organism being constrained in its choice of codons for a particular protein. If a particular protein calls for a huge amount of leucines (CTTCTTCTT) we don't imagine that there's a requirement for an equivalent quantity of AAG to be used somewhere else. And yet, the correlation between frequency-of-occurrence of a codon and its antisymmetric twin is, as I say, surprisingly high in many organisms.

This sort of thing is very hard to explain without invoking a theory of proteogenesis that involves antisense proteins. Imagine a poly-lysine gene of AAA repeated 100 times. The gene gets duplicated on the opposite strand. Now the original strand has 100 AAAs and a run of 100 TTTs. If a reading frame opens up on the TTT stretch (and the protein is beneficial to the organism; it survives), there is now codon/anticodon parity of the kind I'm describing, between codons in the poly-lysine gene and the poly-phenylalanine (TTT) gene.

Why does this relationship hold for high-GC organisms but not as much for low-GC organisms? Probably because antisense genes in high-AT organisms contain a lot of stop codons (TAA, TGA, TAG, which by the way occur at about the same frequencies as TTA, TCA, and CTA, respectively). The presence of few stop codons in high-GC antisense genes gives those genes a chance to be expressed and evolve further. Of course, if you buy this theory, it tends to argue for a "GC World"  scenario in which the early proteosome evolved from GC-rich double-stranded genomes.

To illustrate the unusual correlation I'm talking about, I took the codon frequencies of Pseudomonas fluorescens PF01 (genome-wide) and made a graph that plots the frequency of occurrence of each codon on the x-axis, versus the frequency of occurrence of the corresponding reverse-complement codon on the y-axis. (So if CTA occurs at 0.3% and TAG occurs at 0.2%, I plot a point at [0.3,  0.2].) The SVG graph (below) is interactive: You should be able to hover over a point and see a tooltip that shows the identity of the corresponding codon, and its reverse twin, and their respective frequencies.

NOTE: If your browser does not support SVG, a PNG copy of the graph is here.

The symmetry pattern is expected: For every codon/anticodon there's a corresponding anticodon/codon pair with frequencies swapped. What's more important than the symmetry pattern is the fact that frequency values in Y increase monotonically in X and vice versa, with a correlation coefficient in this case of r=0.63 (F-statistic 41, p < .001). This means that codons tend to occur at about the same frequencies as their reverse complement codons. There are outliers, to be sure, but the overall trend is statistically solid.

Leave a comment if you have any thoughts on what's going on here.

Thursday, April 17, 2014

The Pathogen's Playbook

When comparing pathogenic bacteria with non-pathogenic species of the same genus or family, we often find a common pattern. In the pathogen:
  • The genome is often reduced in size (particularly in endosymbionts, but also in others).
  • The genome is often shifted in the direction of higher A+T content (lower G+C content).
  • Many pseudogenes are present.
  • Often, the pathogen is a slow-grower in pure culture (if it can be cultured at all).
  • The pathogen has special nutritional needs.
An extreme case that illustrates all of these points is Mycobacterium leprae, the leprosy bacterium. It has fewer genes than its cousin, M. tuberculosis (which in turn has fewer genes than non-pathogenic Mycobacteria); its genomic G+C content is 8% lower than most other Mycobacteria; it contains over 1100 pseudogenes; it has a doubling time of two weeks; and it cannot be grown in pure culture (presumably because of fastidious nutritional requirements).

M. tuberculosis can be grown in the laboratory, but it, and its M. avium-group cousins, are very slow growers, taking anywhere from four days to two weeks to develop colonies on solid media.

It seems likely that some pathogens (certainly members of the Mycobacteria, but also the tiny Tenericutes, e.g. Mycoplasma, among many others) have evolved slow growth as a survival strategy. Certainly, organisms that have evolved an intracellular parasitic lifestyle need to be careful not to out-grow the host, if the relationship is to be a long one.

All of the factors listed above suggest a certain scenario, a "pathogen's playbook," if you will, which can be summarized as follows:
  1. The organism invades a warm-booded host.
  2. Phagocytes (white blood cells) ingest the organism.
  3. The phagocytes undergo a respiratory burst, flooding the microbe(s) with peroxides, hypochlorites, nitrous oxide, and other noxious oxidants.
  4. The flood of reactive oxygenated species triggers an SOS response in the microbe.
  5. The microbe's DNA undergoes massive damage. 
  6. Any surviving microbial cells are now pathogenic.
The SOS response is known to trigger mutagenicity. In Mycobacterium, for example, peroxides (as well as UV light) can induce up-regulation of dnaE, an error-prone polymerase. Since Mycobacteria are known to lack a MutS mismatch repair system, SOS-induced errors in DNA replication will almost certainly include uncorrected frameshift errors leading to the creation of pseudogenes. But that's a good thing, if you're a Mycobacterium interested in forming a longterm relationship with a host cell. The loss of certain genes (as long as they're not essential!) will likely slow your metabolism and make you dependent on host nutrients. Truly non-essential pseudogenes will simply be jettisoned over time, reducing the footprint of the remaining genome. Any pseudogenes that survive will likely have done so because they're now playing an essential gene-silencing role.

Let's expand on that last part. Take the dnaE gene, for example. M leprae has two copies of this gene, only one of which is functional. Suppose both copies were functional at the time of the massive pseudogenization event that converted so many of M. leprae's genes to pseudogenes 9 to 20 million years ago. After the pseudogenization event (probably a phagocytic respiratory burst), one copy of dnaE became a pseudogene. But continued transcription of the pseudogene in the forward direction means the pseudo-mRNA competes with the "normal" dnaE transcript for ribosomal attention. Transcription of the antisense strand of the disabled gene would, of course, create a messenger RNA product that could silence the normal transcript by doublestranded interaction. Either way, once the pseudogenization event is over, dnaE expression is attenuated—as it should be, once pathogenicity has been established.

Is it realistic to think M. leprae transcribes antisense strands of its pseudogenes? Given that E. coli has been found to contain ~1000 antisense transcripts, and given that we know M. leprae transcribes many of its pseudogenes, I think the answer has to be yes.

So the pattern is: infection, respiratory burst, massive mutation, silencing of many genes, and (oh by the way) creation of many brand-new gene products, some of them no doubt quite toxic to the host, as the result of gene truncation and pseudogene expression.

Tuesday, April 15, 2014

Coming to Grips with Pseudogenes

The term pseudogene was coined in 1977, when Jacq et al. discovered a version of the gene coding for 5S rRNA in the African clawed frog (Xenopus laevis) that was truncated yet retained homology with the active gene. Subsequent work has shown that in higher life forms, pseudogenes (genes that have been inactivated through one event or another) are almost as numerous as coding genes, with (for example) the human genome containing 10,000 or more pseudogenes. (A more recent estimate puts the number at 20,000.) Many of these pseudogenes are highly conserved. Looking at pseudogenes in the mouse and human, Svensson et al. found that of a group of 74 such genes that occur in both species, 30 appear to have been conserved since before the evolutionary divergence of mice and humans.

In higher organisms, pseudogenes are sometimes transcribed into RNA, with the RNA filling a regulatory function. For example, Korneev et al. found that simultaneous transcription of neural nitric oxide synthase (nNOS) and the antisense strand of a homologous pseudogene in the same neurons of Lymnaea stagnalis (a snail) leads to the formation of a duplex between the two strands and a reduction in nNOS translation. Further examples can be found in Pink et al. (2007), "Pseudogenes: Pseudo-functional or key regulators in health and disease?"

In bacteria, pseudogenes are somewhat rarer than in eukaryotes, but exist in significant numbers in many pathogens (including many species of Mycobacterium, Shigella, Brucella, Bordetella, and others). A study by Kuo and Ochman (2010) found that pseudogenes are swiftly eliminated from Salmonella. They describe "evidence of a strong deletional bias in Salmonella, such that genes that are not maintained by selection are rapidly inactivated and eliminated by mutational events." In fact, Kuo and Ochman found that pseudogenes are eliminated more rapidly than could be explained by the so-called neutral theory of evolution, indicating that the continued presence of pseudogenes exacts a high cost to the cell.

And yet, many bacteria with slow-evolving genomes (such as Mycobacterium species) retain their pseudogenes with high fidelity across evolutionary timespans. The most celebrated "pseudogene hoarder" of all time, M. leprae (the leprosy bacterium) appears to have acquired its 1000+ pseudogenes 9 to 20 million years ago. Meanwhile, the half-life of pseudogenes in Buchnera aphidicola was measured at 23.9 million years—a staggering number.

So on the one hand, we have work by Kuo and Ochman showing that pseudogenes in bacteria are rapidly eliminated, and on the other hand we have some bacterial lineages in which it seems pseudogenes are not only conserved but actively repaired over periods of tens of millions of years!

In Chapter 5 of Brucella: Molecular Microbiology and Genomics (2012, Caister Academic Press), Garcia-Lobo et al. describe their work with RNA sequence data from the bacterium Brucella abortus:
Twenty-four of the genes selected from the RNAseq data were annotated as pseudogenes in the B. abortus 2308 genome, which was considered a rather unexpected finding. By comparison with other Brucella genomes we can reduce the list of highly expressed pseudogenes to 16 (often, truncated parts of a gene are annotated as different pseudogenes especially in B. abortus 2308). This seems contradictory since high transcription of these genes, which should be not able to translate into functional proteins, will be contrary to biological economy. The high levels of transcription observed for these genes strongly suggest that they could be active genes and their products may perform functions unreported in metabolic reconstructions. High pseudogene expression may also indicate that these are very recently produced pseudogenes that did not turned down transcription yet by accumulation of mutations in their promoter or control regions. It is also possible that these pseudogenes may contain sequencing errors and they are indeed active genes.
It's almost comically obvious from this passage that the authors are troubled by their own finding that some pseudogenes in Brucella are highly transcribed. They try explaining it away by saying it could all be "sequencing errors."

A more parsimonious view is that pseudogenes that haven't been eliminated from a genome are, in fact permanent, legitimate fixtures of the landscape, in microbes just as in higher life forms. And as in higher life forms, pseudogenes in microbes are probably serving perfectly understandable regulatory functions (when they're not actually translated into protein products).

Kuo and Ochman have convincingly shown that useless pseudogenes are quickly eliminated. It follows that any pseudogenes that aren't swiftly eliminated are, in fact, serving a biological purpose, or else they wouldn't be there. This line of reasoning is already well accepted by researchers who study eukaryotic life forms. Those who study bacteria need to take a hint from their up-the-food-chain colleagues.

What could the hundreds of pseudogenes in Bordetella pertussis (or the 1000+ pseudogenes in M. leprae) be doing? First we need to get used to the idea that in bacteria, virtually all genes are transcribed, in both directions. It's been four years since Dornernberg et al. reported finding ~1000 antisense transcripts in E. coli, but no one seems to have gotten the memo.

A section of Rothia mucilaginosa genome (top) and a corresponding portion of Mycobacterium leprae (bottom); click to enlarge. The yellow gene, in each case, is DnaE (error-prone polymerase). Pink bands indicate areas of 65% or more homology between the two organisms. The small-diameter silver genes in the lower panel are M. leprae pseudogenes. "Normal genes" are shown in green. Notice that R. mucilaginosa has open reading frames on both strands of DNA, with many bidirectionally overlapping genes.

A look at the genome of the bacterium Rothia mucilaginosa DY18 shows that a very large proportion of "normal genes" have open reading frames on the opposite strand (see illustration). Bidirectional overlapping genes run throughout the Rothia genome. A massive annotation error? Maybe. Or maybe both strands are transcribed.

If massive wholesale transcription of antisense strands occurs in E. coli, as we know it does, certainly it's no stretch to imagine it occurring in Rothia mucilaginosa. And if it is occurring in Rothia, which is (incidentally) an opportunistic pathogen, how much harder can it be to imagine it occurring in another well-known pathogenic member of the Actinomycetales family, Mycobacterium leprae? We know already that upwards of 40% of M. leprae pseudogenes are transcribed. Antisense transcripts could well be playing a role in silencing certain gene essential genes when attempts are made to grow the organism in defined media. Forward transcripts could be producing nonsense or partial-nonsense/truncated proteins that are excreted as toxins or find their way to the cell wall as surface antigens. Any number of scenarios might be possible.

Some very low-hanging fruit is available to micobiologists who are willing to accept the obvious. Instead of wishing away pseudogenes or imagining them to be useless baggage, we should be looking at them as potential determinants of pathogenicity. We should consider their possible roles in modulating protein expression patterns. We should attempt to learn why they're conserved; what role(s) they're playing in cell physiology. The last thing in the world we should be doing is calling them "junk DNA."

Monday, April 14, 2014

Pseudogenes Are Not Junk DNA

In 2007,  a PLoS ONE paper by Ahmed et al. proposed a phylogeny for Mycobacteria in which M. leprae (the leprosy organism) is shown as a relatively recent branch off a very long tree, with M. tuberculosis depicted (in a decidedly fanciful schematic) as being of relatively recent provenance (35,000 years), diverging from M. canettii (a recently discovered cousin of tuberculosis) 3 million years ago.

The rather fanciful phylogenetic picture of Mycobacterium evolution presented by Ahmed et al. (2007). Click to enlarge.

The only trouble with this picture is that we know it's wrong. More exacting work has shown that M. tuberculosis is at least 3 million years old, and one paper estimates that the common ancestor of TB and leprosy may go back 66 million years. If the latter figure sounds dubious, consider that until recently, M. leprae wasn't thought to have any sister strains that could aid with dating the organism phylogenetically. But in 2008, the situation changed dramatically when it was realized that in Mexico, a distinct form of leprosy known as "diffuse lepromatous leprosy" (DLL) was actually due to a genetically distinct variant of Mycobacterium known as M. lepromatosis. When the genome for the latter organism was analyzed, it was found to contain the same stupendous assortment of pseudogenes contained in M. leprae, but detailed analysis of polymorphisms in the genomes of the two strains led to a surprising finding: Divergence of the strains appears to have occurred around 10 million years ago.

Another team found that the massive "pseudogenization event" that caused M. leprae (and its cousin, M. lepromatosis) to become saddled with a record number (1,116) of pseudogenes probably occurred on the order of 20 million years ago.

The age and stability of the pseudogenes in M. leprae can only be described as stunning. Conventional evolutionary dogma says that pseudogenes will inevitably be degraded and lost over time. Surely M. leprae can't be conserving and repairing pseudogenes over 10-million-year-long timespans? Pseudogenes are discardable junk.

Or are they?

An analysis of Buchnera aphidicola (the tiny Enterobacterial endosymbiont of the pea aphid) put the half-life of pseudogenes in that organism at 23.9 million years.

Human DNA reportedly contains over 12,000 pseudogenes. Some of these pseudogenes are quite old. Parallel nonsense mutations caused a pseudogenization of the uricase gene in apes during the early Miocene era (17 million years ago). We still carry the pseudogene in question—and it gets transcribed. According to a report by James T. Kratzer and colleagues at the University of Texas, Austin:
Despite being nonfunctional, cDNA sequencing confirmed that uricase mRNA is present in human liver cells and that these transcripts have two premature stop codons.
The inevitable conclusion is that pseudogenes are not, and should not be considered by default, "junk DNA." To the contrary, the default assumption should be that pseudogenes are ancient and conserved—because in most cases, that's exactly what they are.

What causes genes to "go pseudo"? Why are they conserved? What are they really doing? I'll tackle some of those questions in a followup post. Stay tuned.


Sunday, April 13, 2014

Whooping Cough Genomics

Pertussis, also known as whooping cough, is a highly contagious respiratory infection caused by Bordetella pertussis, a small aerobic bacterium that secretes numerous toxins capable of disrupting a normal immune response. The disease is rarely fatal but leaves victims with a nasty cough that can last weeks. In 2012, in the U.S., some 48,277 cases of pertussis were reported to the CDC. Of those cases, only 20 were fatal. By contrast, 28 Americans were killed by lightning the same year.

Bordetella pertussis
Unlike tuberculosis (which has been with us for 3 million years), Bordetella shows evidence of being a fairly new (and still rapidly evolving) pathogen, although in this case "fairly new" could still mean 700,000 years.

The complete DNA sequence of B. pertussis has been available for several years. It shows a moderate-size genome (of 4 million base pairs) encoding 3,447 genes, with a substantial number (360) of pseudogenes. The latter represent genes that have (by one means or another) been inactivated, whether through the appearance of premature stop codons in the gene, loss of a promoter region, random deletions, or what have you.

What makes Bordetella's pseudogenes interesting is that they're in remarkably good shape, as pseudogenes go. Usually, once a gene gets inactivated (goes pseudo), it begins to accumulate random point mutations, deletions, insertions, etc. at a substantial rate. In other words it deteriorates, since (supposedly) it's no longer under selection pressure. But when Australian researchers looked at 358 pseudogenes in B. pertussis Tohama I strain, they were shocked to find that the rate of nucleotide polymorphisms (i.e., changes to individual base-pairs in the DNA) was actually lower in pseudogenes than in regular genes (4.7E-5 per site versus 5.1). That's exactly the opposite of what's expected. The researchers commented, somewhat laconically: "This suggests that most pseudogenes in B. pertussis were formed in the recent past and are yet to accumulate more mutations than functional genes."

What other explanation is there? Well, the most obvious alternative explanation is that the genes are still under selection pressure, even though they're turned off. How can that be? I can think of any number of scenarios; perhaps that'll be a future blog post. Suffice it for now to say, ribosomes are not totally unforgiving of missing stop codons (read up on tmRNA) nor are they unforgiving, in all cases, of frameshifts (read about programmed frameshifts), and if an open reading frame should appear on a pseudogene's antisense strand, you now have an RNA silencer (potentially) for the remaining good copy or copies of the gene, with attendant gene-modulation possibilities.

It's worth pointing out that pseudogenes in M. leprae (the leprosy bacterium) are not only conserved and ancient but continue to show strong homology to working orthologues in M. tuberculosis (and even more distantly related organisms such as Gordonia, Corynebacterium, and Nocardia) after millions of years. More of which, in a later post.

For now, I thought it might be worth looking at the base composition of B. pertussis pseudogenes to see if they're riddled with frameshift errors (as is the case with M. leprae's pseudogenes). When I analyzed all 1,125,521 codons for all normal (not pseudo) genes in B. pertussis Tohama I strain, the resulting "paintball diagram" of base composition came up looking like this:
Paintball diagram for normal genes in B. pertussis Tohama I (click to enlarge). Red dots are for codon base one, gold represents the composition at codon base two, blue is "wobble" (third) base composition. Every dot represents statistics for one gene (n=3447). See text for discussion.

Here, we're looking at purine (A+G) content versus G+C content for each base position in the codons. Every dot represents a gene's worth of data. Not unexpectedly, the most extreme G+C values occur in the third ("wobble") base. Codon base one (red dots) is purine-rich, centering on y=0.58. This is typical of most codons in most genes, in most organisms. Notice the "breakaway cloud" of gold points underneath the main gold cloud (at y<0.4). These points represent genes in which the second codon base is mostly a pyrimidine (C or T). Codons with a pyrimidine in base two tend to code for nonpolar amino acids. Thus, the breakaway cloud of gold points represents membrane-associated proteins. In this case, we're looking at about 558 genes falling in that category.

Now look at the paintball diagram for the organism's 360 pseudogenes:

Base composition for "codons" in 360 pseudogenes of B. pertussis Tohama I. (Click to enlarge.) In this graph, as in the one above, dots are rendered with an opacity of 60% (so that overlapping points are less likely to obscure each other). See text for discussion.

In this case, there's a considerable amount of random statistical splay, but some of that is due simply to the fact that pseudogenes are a good deal shorter than normal genes, giving rise to more noise in the signal. (In this case, the average length of a pseudogene is 482 bases, vs. 982 for the 3,447 "normal" genes.) Even with considerable noise, though, it's apparent that the dot clusters tend to center on different parts of the graph, corresponding to the expected locations for normal genes. (Contrast this with the situation in M. leprae, where pseudogenes are riddled with frameshifts, rendering the concept of "codon base position" moot. Refer to the second paintball graph on this page.) Thus, we can say with some confidence that frameshifts are not so rampant in B. pertussis pseudogenes as to have rendered the concept of codons irrelevant. In fact, compared to M. leprae, pseudogenes in B. pertussis are comparatively unaffected by frameshifts. This tends to support the view of the Australian researchers (mentioned earlier) that pseudogenes in B. pertussis have not had enough time to accumulate very many mutations. But it can also be hypothesized that B. pertussis has had plenty of time (700,000 years, in fact) in which to accumulate mutations in its pseudogenes, yet has not done so. The evidence suggests that if anything, Bordetella repairs pseudogenes even more faithfully than regular genes.

At this point it might be relevant to interject that while M. leprae (like other members of the Mycobacteria) lacks the MutS/MutL mismatch repair system, Bordetella does, in fact, have a MutS/MutL mismatch repair system, and this may explain the relative paucity of frameshift errors in Bordetella pseudogenes. But it also implies (rather queerly) that Bordetella goes out of its way to repair its pseudogenes.

Interestingly, 234 out of 360 pseudogenes have a AG1 (purine, base one) content greater than 55%, which means they're probably still "in frame." Of these 234, some 69 (30%) have AG2 less than 40%, meaning they're most likely genes for membrane-associated proteins. If we look at the 2,456 normal genes that have AG1 greater than 55%, only 398 (16%) are putative membrane-associated proteins (with AG2 less than 40%). Bottom line: Pseudogenes for putative membrane-associated proteins are twice as likely to still be in-frame. While this could be a statistical fluke, it could also be that membrane proteins are somehow "spared" preferentially when it comes to leaving pseudogenes translatable. To put it differently: Pseudogenes for non-membrane-associated proteins are less likely to remain in-frame. This makes sense, in that much of Bordetella's pathogenicity can be ascribed to proteins that make up cell-surface antigens or that transport toxins to the outside world. Some of the toxic surface proteins may, in fact, be nonsense (or partial-nonsense) proteins—products of pseudogenes.