Saturday, June 22, 2013

A Simple Method for Estimating the Rate of Transition vs. Transversion Mutations

Point mutations in DNA fall into two types: transition mutations, and transversion mutations. (See graphic below.)

In a transition mutation, a purine is swapped for a different purine (for example, adenine is swapped with guanine, or vice versa), or a pyrimidine is swapped with another pyrimidine (C for T or T for C); and usually, if a purine is swapped on one strand, the corresponding pyrimidine gets swapped on the other. Thus, a GC pair gets changed out for an AT pair, or vice versa.

A transversion, on the other hand, occurs when a purine is swapped for a pyrimidine. In a pairwise sense, this means a GC pair becomes a TA pair (for example) or an AT pair gets changed out for CG, or possibly AT for TA, or GC for CG.

Of the two types of mutation, transitions are more common. We also know that, in particular, GC-to-AT transitions are much more common than AT-to-GC transitions, for reasons that are well understood but that I won't discuss here. If you're curious to know what the experimental evidence is for the greater rate of GC-to-AT transitions, see Hall's 1991 Genetica paper (paywall protected, unfortunately) or the non-paywall-protected Y2K J. Bact. paper by Zhao. The latter paper is interesting because it shows that GC-to-AT transitions are more common in stationary-phase cells than exponentially-growing cells, and also, transitions in stationary E. coli are repaired by MutS and MutL gene products. (Overexpression of those two genes results in fewer transitions. Mutation of those two genes results in more transitions.)

An open question in molecular genetics is: What are the relative rates of transitions versus transversions, in natural populations? We know transitions are more common, but by what factor? Questions like this are tricky to answer, for a variety of reasons, and the answers obtained tend to vary quite a bit depending on the organism and methodology used. Van Bers et al. found a transition/transversion ratio (usually symbolized as κ) of 1.7 in Parus major (a bird species). Zhang and Gerstein looked at human DNA pseudogenes and found transitions outnumber transversions "by roughly a factor of two." Setti et al. looked at a variety of bacteria and found that the transition/transversion rate ratio for mutations affecting purines was 2.1 whereas the rate ratio for pyrimidines was 6.6. Tamura and Nei looked at nucleotide substitutions in the control region of mitochondrial DNA in chimps and humans (a region known to evolve rapidly) and found κ to be approximately 15. Yang and Yoder looked at mitochondrial cytochrome b in 28 primate species and found an average κ of 6.4. (In general, κ values tend to be considerably higher for mitochondrial DNA than other types of DNA.)

It's important to note that in all likelihood, no single value of κ will be universally applicable to all genes in all lineages, because evolutionary pressures vary from gene to gene and the rates of transition and transversion are different for different nucleotides (and so codon usage biases come into play). For an introduction to the various considerations involved in trying to estimate κ, I recommend Yang and Nielsen's 2000 paper as well as their 1998 and 1999 papers.

The reason I bring all this up is that I want to offer yet another possible way of estimating the transition/transversion rate ratio κ, using DNA composition statistics. Earlier, I presented data showing that the purine (A+G) content of coding regions of DNA correlates directly with genome A+T content. Analyzing the genomes of representatives of 260 bacterial genera, I came up with the following graph of purine mole-percent versus A+T mole-percent:

The correlation between genome A+T content and mRNA purine content is strong and positive (r=0.852) . Szybalski's Rule says that message regions tend to be purine-rich, but that's not exactly accurate. When genome A+T content is below approximately 35%, coding regions are richer in pyrimidines than purines. Above 35%, purines predominate. The concentration of purines in the mRNA-synonymous strand of DNA rises steadily with genome A+T content. It rises with a slope of 0.13013.

If you try to envision evolution taking an organism from one location on this graph to another, you can imagine that GC-to-AT transitions will move an organism to the right, whereas AT-to-GC transitions will move it to the left. To a first approximation (only!) we can say that horizontal movement on this graph essentially represents the net effect of transitions.

Vertical movement on this graph clearly involves transversions, because a net change in relative A+G content implies nothing less. To a very good first approximation, vertical movement in the graph corresponds to transversions.

Therefore, a good approximation of the relative rate of transitions versus transversions is given by the inverse of the slope. The value comes to 1.0/0.13013, or κ = 7.6846.

In an earlier post, I presented a graph like the one above applicable to mitochondrial DNA (N=203 mitochondrial genomes), which had a slope of 0.06702. Taking the inverse of that slope, we get a value of κ =14.92, which is in excellent agreement with Tamura and Nei's estimate of 15 for mitochondrial κ.

When I made a purine plot using plant and animal virus genomes (N=536), the rise rate (slope) was 0.23707, suggesting a κ value of 4.218. This agrees well with the transition/transversion rate for hepatitus C virus (as measured by Machida et al.) of 1.5 to 7.0 depending on the gene.

In short, we get very reasonable estimates of κ from calculations involving the slope of the A+G vs. A+T graph, across multiple domains.

The main methodological proviso that applies here has to do with the fact that technically, some horizontal movement on the graph can be accomplished with transversions (AT-to-CG, for example). We made a simplifying assumption that all horizontal movement was due to transitions. That assumption is not strictly true (although it is approximately true, since transitions do outnumber transversions; and some transversions, such as AT<-->TA and GC<-->CG, have no effect on genome A+T content). Bottom line, my method of estimating κ probably overestimates κ somewhat, by including a small proportion of AT<-->CG transversions in the numerator. Even so, the estimates agree well with other estimates, tending to validate the general approach.

I invite comments from knowledgeable specialists.

Friday, June 21, 2013

RNA Folding and Purine Loading

The other day I learned that an acquaintance of mine had done graduate work in a famous molecular genetics lab. We started "talking shop," and I happened to mention some of my recent bioinformatics forays, in particular my recent unexpected finding that the purine content of mRNA can be predicted from the G+C (guanine plus cytosine) content of the genome.

The purine (A+G) content of protein-coding regions of DNA correlates with the overall A+T content of the genome. The higher the A+T content of the double-stranded DNA, the higher the purine content of the single-stranded mRNA. A total of 260 bacterial genomes were analyzed for this graph. Organisms with very high A+T content tend to have relatively small genomes, which is one reason there is more scatter toward the right side of the graph. Correlation: r=0.852.

My friend asked what the implications of this might be. I offered a couple of thoughts. First, I said that just as differences in G+C content between genes in a given organism can sometimes be used to detect foreign genes (e.g., embedded phage/virus genes, horizontal gene transfers, etc.), variations in the purine to pyrimidine ratio of gene coding strands might also be a way to detect foreign genes. For example, in an organism like Clostridium botulinum, where the genome's coding regions have an average purine content of 58.5%, finding a gene with purine content below 46% (two standard deviations away from the mean) might be a tipoff that the gene came from a different organism. This is a useful new technique, because genes with high-purine-content coding regions don't always have high A+T content (thus, detection of horizontal gene transfers via purine loading will expose genes that would otherwise be missed on the basis of G+C  content). In other words, two genes might have exactly the same G+C (or A+T) characteristics but differ in purine content. The difference in purine content would be the tipoff to a possible horizontal-gene-transfer event.

Another implication of the A+G versus A+T relationship involves foreign RNA detection. Bacteria need to be able to detect self versus non-self nucleic acids. (Incoming phage nucleic acids need to be detected and destroyed; and in fact, they are. This is how restriction enzymes were discovered.) Messenger RNA has secondary structure: it undergoes folding, based on intrastrand regions of complementarity. The amount of complementarity depends on the relative abundances of purine and pyrimidines that can pair with one another. If a strand of RNA is mostly purines (or mostly pyrimidines, for that matter), there will be less opportunity to self-anneal than if purines and pyrimidines are equally abundant. Thus, the folding of RNA will be different in an organism with high genome A+T content (low G+C content) than in an organism with low A+T.

An example of how purine loading can affect folding is shown below. The graphic shows the minimum-free-energy folding of the mRNA for catalase in Staphylococcus epidermidis strain RP62A (left) and Pseudomonas putida strain GB-1 (on the right). The Staph version of this messenger RNA has a 1.28 ratio of purines to pyrimidines, whereas the Pseudomonas version has a 0.98 purine-pyrimidine ratio. As a result, the potential for purine-pyrimidine hydrogen bonding is considerably less in the Staph version of the mRNA than in the Pseudomonas version, and you can easily see this by comparing the two RNAs shown below. The one on the left has far more loops (areas where bases are not complementary) and complex branching structures. In the mRNA on the right, long sections of the molecule are able to line up to form double-stranded structures; loops are few in number, and small.

The minimum-free-energy folding for two catalase mRNAs, one with high purine content (Staphylococcus, left) and one with lower purine content (Pseudomonas, right). Foldings were generated by Click to enlarge image.

This kind of difference can explain the ability of various strains of bacteria to reject infectious RNA from another strain's viruses (phage). Foreign RNA entering a cell will "look" foreign to the cell's endogenous complement of RNA nucleases, and based on this, host nucleases will quickly destroy the intruder RNA. This mechanism provides a primitive kind of immune system for bacteria.

There is one other important implication of the purine-loading curve. The curve resolves one long-standing open question in molecular biology, having to do with mutation rates. I'll talk about it in tomorrow's post. Please join me then—and bring a biologist-friend!

Monday, June 17, 2013

An Example of Antisense Proteogenesis?

The question of how organisms develop entirely new genes is one of the most important open questions in biology. One possibility is that new genes often develop through accidental translation of antisense strands of DNA.

An example of this can be seen with the S1 protein of the 30S bacterial ribosome. If you take the amino-acid sequence for an S1 gene and use it as the query sequence in a blast-p (protein blast), you'll mostly get back hits on other S1 proteins, but you'll also get minor (low-fidelity) hits on polynucleotide phosphorylase. Why? When you do a blast search, the search engine, by default, looks at both DNA strands of target genes (sense and antisense strands) to see if there's a potential sequence match with the query. If there's a match on the antisense strand, it will be reported along with "sense" matches. In the case of the S1 protein, blast-p searches often report weak antisense hits on polynucleotide phosphorylase in addition to strong sense hits on ribosomal S1.

Ribosomal proteins are, of course, among the most highly conserved proteins in nature. It turns out that polynucleotide phosphorylase (PNPase) is very highly conserved as well. It's an enzyme that occurs in every life form (bacteria, fungi, plants, animals), absent only in a scant handful of microbial endosymbionts that have lost the majority of their genes through deletions. While the chemical function of PNPase is well understood (it catalyzes the interconversion of nucleoside diphosphates to RNA), its physiologic purpose is not well understood, although recent research shows that PNPase-knockout mutants of E. coli exhibit lower mutation rates. (Hence, PNPase may actually be involved in generating mutations.)

The bacterium Rothia mucilaginosa, strain DY18, has a (putative) PNPase gene at a genome offset of 1277514. When this gene is used as the query for a blast-p search, the hits that come back include many strong matches for the S1 ribosomal proteins of various organisms. By "strong match," I mean better than 80% sequence identity coupled with an E-value (expectation value) of zero. (Recall that the E-value represents the approximate odds of the match in question happening due to random chance.

If we use the Genome Viewer at to look at the PNPase gene of Rothia mucilaginosa, we see something extraordinarily peculiar (look carefully at the graphic below). Click to enlarge the following image, or better yet, to see this genome view for yourself, go to this link.

Notice the presence of overlapping sense and antisense open reading frames on a portion of DNA from Rothia mucilaginosa. The top reading frame contains the gene for polynucleotide phosphorylase. The lower (-1 strand) reading frame contains ribosomal S1. To see this in your own browser, go to this link.

Notice that there are overlapping genes. On the top strand is the gene for PNPase; on the bottom strand, in the same location, is a gene for ribosomal S1. These are bidirectionally overlapping open reading frames, something occasionally encountered in virus nucleic acids but rarely seen in bacterial or other genomes.

How do we explain this anomaly? It could be just that: an anomaly, two open reading frames that happen to overlap (but that aren't necessarily translated in vivo). Or it could be that at some point, many millions of years ago, the ribosomal S1 gene of a Rothia ancestor was erroneously translated via the antisense strand, producing a protein with PNPase characteristics. We don't know why PNPase confers survival value (its physiologic purpose is not fully understood), but we do know, with a fair degree of certainty, that PNPase does, in fact, confer survival value—because every organism, at every level of the tree of life, has at least one copy of PNPase. Once Rothia's ancestor, through whatever process, opened up a reading frame on the antisense strand of ribosomal S1, the reading frame stayed open, because it conferred survival value. In this way, the first Rothia PNPase was born. (Arguably.)

At some point in its history, Rothia duplicated its PNPase gene and placed a new copy at genome offset 1650959. Over time, this second copy diverged from the original copy, becoming more like E. coli PNPase (which is also to say, less S1-like). Rothia's second PNPase shows a blast-p similarity of 45% (in terms of AA identities) to E. coli PNPase, with E-value 4.0e-147. It shows a blast-p similarity of 26% (AA identities) with E. coli ribosomal S1 (E-value: 4.0e-17). Neither E. coli PNPase nor Rothia PNPase-2 overlaps an S1 gene. However, both are colocated with the ribosomal S15 protein gene. And you'll find (if you look at lots of bacterial genomes) that PNPase is almost always located immediately next to an S15 ribosomal gene.

Rothia PNPase is an example of an enzyme that may very well have started out as an antisense copy of another protein (the S1 ribosomal protein). Of course, the mere presence of bidirectionally overlapping open reading frames doesn't prove that both frames are actually transcribed and translated in vivo. But the fact that blast-p searches using PNPase as the query almost always turn up faint S1 echoes (in a wide variety of organisms) is highly suggestive of an ancestral relationship between the two proteins.

Sunday, June 16, 2013

Evolution and Antisense Translation of DNA

Yesterday I offered a theory for new gene creation which might be called the Erroneous Translation Theory. Basically, I proposed that new proteins arise through frameshifted and/or reversed translation of nucleic acids (translation of antisense strands of DNA).

Erroneous translation of DNA offers interesting possibilities for gain of function. (Recall that most point mutations result in loss of function, and one of the major criticisms of Darwinian theory is that evolution based on accumulation of point mutations cannot account for gain-of-function events.) Wholesale mistranslation via frameshift errors and/or wrong-strand transcription allow for the sudden emergence of entirely new classes of proteins. The unit of change is no longer the single base-pair polymorphism but the functional domain or motif.

An important aspect of antisense-strand translation has to do with stop codons. In DNA, the sequences TCA, TTA, and CTA specify amino acids serine, leucine, and leucine, respectively. But when these three codons are complemented, then read in 5'-to-3' direction—in other words, when they're antisense-translated—they form the stop codons TGA, TAA, and TAG, which tell the cell's protein-making machinery to terminate the production of the current polypeptide. Thus, if a typical gene containing codons TCA, TTA, and CTA is translated "backwards," translation will end prematurely: It will end as soon as a stop codon is encountered.

How important a consideration is this in the real world? Consider the following DNA sequence, which represents the gene for the cytidine deaminase enzyme of Clostridium botulinum:


The above sequence is the "sense" strand of the DNA, in 5'-to-3' direction. The sequence below is the corresponding 3'-to-5' complementary sequence (in other words, what's on the antisense strand of DNA):


When the antisense sequence is translated in the normal 5'-to-3' direction, the following amino acid sequence results:


This sequence of 146 amino acids (shown here using standard one-letter amino-acid abbreviations) contains 10 stop codons (depicted as asterisks). Any attempt to translate the antisense strand of the C. botulinum cytidine deaminase gene will result in (at best) a series of short oligopeptides.

It's tempting to conclude that this is nature's ingenious way of preventing the occurrence of nonsense proteins. Translate the wrong strand of DNA by mistake, and translation quickly terminates. (In the above example, a stop codon occurs every 14 amino acids, on average.) But before you jump to that conclusion, consider the cytidine deaminase gene of Anaeromyxobacter dehalogenans strain 2CP-C:


The translation of the antisense version of this gene is:


Which contains no stop codons! Why does one version of the gene give ten stop codons when anti-translated, whereas the other version gives zero stop codons? Clostridium botulinum has a genome G+C content of 28% whereas the DNA of Anaeromyxobacter dehalogenans has a G+C content of 74%. The two organisms favor entirely different codons. Anaeromyxobacter uses codons TCA, TTA, and CTA only 0.03%, 0%, and 0.02% of the time, respectively. Clostridium uses the same codons 1.72%, 5.62%, and 4.67% of the time—over 200 times more often than Anaeromyxobacter.

Bottom line: Almost any gene in Anaeromyxobacter (or any high-GC organism, it turns out) can be antisense-translated without generating stop codons. Stop codons occur in antisense genes in inverse proportion to the amount of G+C in the gene.  

If it's true that antisense-strand translation is (or has been) an important source of new proteins in nature, the foregoing observation is tremendously relevant, because it means successful reverse translation has likely occurred far more often in high-GC organisms than in low-GC organisms. It suggests that bacteria with high G+C content in their genomes may, in fact, have been the incubators of early proteins. It implies a "GC Eden" scenario in which early life forms had predominantly high-GC genomes. Low-GC organisms then arose through continuous "AT pressure," from large numbers of accumulated GC-to-AT transition mutations. (We know that GC-to-AT transition mutations occur at a much higher rate than AT-to-GC transitions; this fact is not in dispute.)

Even so, we have to ask: What is the evidence for reverse (antisense-strand) translation having occurred in nature? Is there any such evidence?

More on this subject tomorrow.

Saturday, June 15, 2013

Thoughts on New Gene Origination

The other day, I wrote a damning critique of Darwin's theory and offered nothing in the way of a positive alternative to the traditional view of accumulated-point-mutations as a driving force for evolution. It's easy to take potshots at someone else's theory and walk away. As a rule, I don't like naysayers who criticize something, then offer nothing in return. So I'd like to take a moment to try to offer a different perspective on evolution. In particular, I'd like to offer my own theory as to how new genes arise.

The question of where new genes comes from is, of course, one of the foremost open problems in biology. Current theory revolves mostly around gene duplication followed by modification of the duplicated gene (via mutations and deletions) under survival pressure [reference 4 below]. Gene fusion and fission have also been proposed as mechanisms for gene origination [3]. In addition, genes derived from noncoding DNA have recently been described in Drosophila [1]. Likewise, transposons (genes that jump from one location to another) have been implicated in gene biogensis [3].

The problem with these theories is that various enzymes are required in order for duplication, transposition, fusion, fission, etc., to occur (to say nothing of transcription, translation initiation, translation elongation, and so on), and existing theories don't explain how these participating enzymes appeared, themselves, in the first place. A fully general theory has to start from the assumption that in pre-cellular, pre-chromosomal, pre-organismic times, genes (if they existed) may have occurred singly, with multiple copies arising through non-enzymatic replication. Likewise, we should assume that early protein-making machinery was probably non-enzymatic, which is to say entirely RNA-based (i.e., ribozymal). If the idea of catalytic RNA is new to you or sounds unreasonably farfetched, please review the 1989 Nobel Prize research by Altman and Cech.

The fundamental mechanisms of de novo gene creation available in pre-enzymatic times might well have been nothing more than ribozymal duplication of nucleic acid sequences followed by erroneous translation. "Erroneous translation" can be of two fundamental types: frameshifted translation, and reverse translation. (Reverse translation here means transcription of the antisense strand of DNA and subsequent translation to a polypeptide.)

DNA is parsed 3 bases at a time (the 3-base combinations are called codons; each codon corresponds to an amino acid). If a single base is spuriously added to, or deleted from, a gene, the reading frame is disrupted and a hugely different amino-acid sequence results. This is called a frameshift error or frameshift mutation.

Spurious addition or deletion of a single base to a free-floating piece of single-stranded genetic material (RNA or DNA) is all that's needed in order to cause frameshifted translation. The protein that results from a frameshift error is, of course, in general, vastly different from the original protein.

If pre-organismic nucleic acids were single-stranded, then reverse translation would require 3'-to-5' reading of the nucleic acid as well as 5'-to-3' reading. If, on the other hand, early nucleic acids were double-stranded, then 5'-to-3' (normal direction) translation of each strand would suffice to give one normal and one reverse translation product. (Note for non-biologists: In all known current organisms, reading of DNA and RNA takes place in the 5'-to-3' direction only.)

Nucleic acids (RNA and DNA) have directionality, defined by the orientation of sugar backbone molecules in terms of their 5' and 3' carbons.

It's interesting to speculate on the role of reverse translation in production of novel proteins, especially as it applies to early biological systems. We don't know if early systems relied on triplet codons (or even if all four bases—guanine, cytosine, adenine, thymine—existed from the beginning). We also don't know if there were 20 amino acids in the beginning. There may have been fewer (or more).

A novel possibility is that early triplet codons were palindromic (giving identical semantics when read in either direction). There are 16 palindromic codons in the codon lexicon (AGA, GAG, CAC, ACA, ATA, TAT, AAA, and so on) which today encode 15 amino acids out of the 20 commonly used. In a palindromic-codon world, the distinction between "sense" and "antisense" nucleic acid sequences vanishes, because a single-stranded gene made up of palindromic codons could be translated in either direction to give a polypeptide with the same sequence, the only chirality arising from N- to C-terminal polarity. For example, the sequence GGG-CAC-GCG-AAA would give a polypeptide of glycine-histidine-alanine-lysine whether translated forward or backward, the only difference being that the forward version would have glycine at the N-terminus whereas the reverse version would have glycine at the C-terminus. The secondary and tertiary structures of the two versions would be the same. As long as catalytic function didn't directly depend on an amino or carboxy terminus of an end-acid, the two proteins would also be functionally indistinguishable.

Codon palindromicity is potentially important in any system in which single-stranded genes are bidirectionally translated, because in the case where a gene does happen to rely heavily on palindromic codons, the reverse-translated product will (for the reasons just explained) have the potential to be functionally paralogous to the forward-translated product (to an extent matching the extent of palindromic-codon usage). But this assumes that in early organisms (or pre-organismic soups), single-stranded genes could be translated in the 5'-to-3' direction or the 3'-to-5' direction.

It turns out modern organisms differ markedly in the degree to which they use palindromic codons, and there are (remarkably) some prokaryotes whose genes use an average of ~40% palindromic codons. The complementary strand of DNA would, of course, contain palindromic complements: AGA opposite TCT, CCC opposite GGG, etc.

All of this makes for interesting conjecture, but does any of it really apply to the natural world? For example: Do organisms actually employ strategies of "erroneous translation" in creating new proteins? Did today's microbial meta-proteome arise through mechanisms involving frameshifted and/or reverse translation? Is there any evidence of such processes, one way or the other? Tomorrow I want to continue on this theme, presenting a little data to back up some of these strange ideas. Please join me; and bring a biologist-friend with you!

1. Begun, D., et al. Evidence for de novo evolution of testis-expressed genes in the Drosophila yakuba/Drosophila erecta clade. Genetics 176, 1131–1137 (2007).
2. Fechotte, C., & Pritham, E. DNA transposons and the evolution of eukaryotic genomes. Annual Review of Genetics 41, 331–368 (2007)
3. Jones, C. D., & Begun, D. J. Parallel evolution of chimeric fusion genes. Proceedings of the National Academy of Sciences 102, 11373–11378 (2005).
4. Ohno, S. Evolution by Gene Duplication (Springer-Verlag, Berlin, 1970).

Wednesday, June 12, 2013

The Trouble with Darwin

As a biologist, I find Darwin's theory hugely disappointing. It's better than the alternative (which is to believe in magic, basically), but not by much, sadly.
Charles Darwin died before Mendel
proved the existence of genes

As scientific theories go, the theory of evolution is easily the weakest of all major scientific theories. It's a commendable piece of work in its ability to stir discussion, but terrible in most other ways.

To be useful, a scientific theory has to do a minimum of two things: explain what can be observed, and provide testable predictions. Darwin's theory is weak on the first count and useless on the second.

Evolutionary theory explains practically nothing, because every explanation of the theory is rooted in "survival of the fittest," which is a circular notion, utterly content-free. "Fittest" means most able to survive. Survival of the fittest means survival of those who survive.

Ironically, Darwin's landmark work was called On the Origin of Species. Yet it doesn't actually explain speciation, except in the most vacuous and speculative of terms. Of course, we can't set too high an expectation for Darwin, since he didn't live to see the publication of Mendel's work (the word "genetics" wouldn't exist until more than 20 years after Darwin's death), but still. Speciation is portrayed by Darwin as the outcome of the accumulation of small, gradual changes. That's all the explanation he offers.

But the explanation is wrong. Or at least it doesn't accord well with the facts. It doesn't explain the Cambrian Explosion, for example, or the sudden appearance of intelligence in hominids, or the rapid recovery (and net expansion!) of the biosphere in the wake of at least five super-massive extinction events in the most recent 15% of Earth's existence.

One of the most frustrating aspects of evolutionary theory (this is no fault of the theory's, though) is that it is so hard to test in the laboratory. The fact is, no one has ever seen speciation happen in the laboratory, under repeatable conditions, and until that happens we're at a distinct disadvantage for understanding speciation. (Incidentally, I don't count plant hybridization or breeding anomalies in fruit flies whose sexuality is under the control of microbial endosymbionts as examples of speciation.)

When I was in school, we were taught that mutations in DNA were the driving force behind evolution, an idea that is now thoroughly discredited. The overwhelming majority of non-neutral mutations are deleterious (they reduce, not increase, survival). Most mutations lead to loss of function (this is easily demonstrated in the lab), not gain of function. Evolutionary theory is great at explaining things like the loss of eyesight by cave-dwelling creatures (e.g., bats). It's terrible at explaining gain of function.

Even if mutations were capable of driving evolution, they simply don't happen fast enough to account for observed rates of speciation. In bacteria, the measured rate of 16S rRNA divergence due to point mutations is only 1% per 50 million years. And yet, there were no flowering plants on earth as recently as 150 million years ago! Does it take a biologist to see the disconnect?

I bring all this up because I've spent some time recently doing genomics research aimed at exploring mechanisms for new-protein creation/differentiation (mechanisms not relying wholly nor even mainly on point mutations), and I wanted to set the stage for discussing that research here. Over the next week or so, I'll be presenting some new ideas and findings. Hopefully, we can put some much-needed flesh on Darwin by exploring testable notions of how new protein motifs can arise quickly (without reliance on magic).

Monday, June 10, 2013

A Catalase Conundrum

When I was in grad school (U.C. Davis) in the late 1970s, the bacterial world was simply the prokaryotic world, and vice versa. There hadn't yet come a distinction between eubacteria and Archaea. But now we know, or think we know, that prokaryota come in two fundamental flavors: the true bacteria (eubacteria), and the Archaea (primitive extremophiles). If you were to want to count organelles (mitochondria, chloroplasts, others) as a third fundamental grouping, I suppose you could, with some justification.

At this writing, about 400 distinct Archaeal isolates, belonging to around 75 genera, have been DNA-sequenced. You can see a list of them by going to and looking in the Organisms box. You'll see over 200 organisms listed, but bear in mind they belong to only about 75 genera. (Most genera are represented by more than one species and/or more than one isolate per species, in other words.)

Salt-loving Archaea species have been found growing in borax-saturated
desert ponds. The species growing in this small lake produce a
carotenoid pigment that gives the water a pink appearance.
The Archaea were once thought to be exclusively anaerobes, but it turns out there are a couple dozen aerobic (or facultatively anaerobic) genera in the group. In my own spare-time research, I've found that about 20% of the 75 sequenced Archaeons (all of them obligate anaerobes) have a catalase gene. (Catalase is the enzyme that breaks hydrogen peroxide down to water and oxygen.) Oddly, very few of the aerobic Archaea (except for the Halobacteriaceae group) show any evidence of having catalase. This is exactly the reverse of what's expected. In the rest of the living kingdom (from bacteria to higher plants and animals), aerobes universally have catalase; strict anaerobes don't have catalase (or at least, they aren't supposed to; but see this post for some surprising exceptions).

This is a hugely unexpected finding: Many anaerobic Archaeons have catalase, but not all aerobic ones do. Some enterprising grad student should tackle this and make a thesis project out of it.

In case you're that student, here are some additional clues.

Let's back up for a second and look at the Big Picture. No matter where on the Tree of Life you go, catalases come in only a few major types. (See the excellent 2003 review paper by Chelikani, Fita, and Loewen for details.) For example, there are heme-containing and non-heme catalases. Most of the time, what we think of as "catalase" is heme-containing catalase (and yes, that means it contains iron). In the heme-containing group, you have monofunctional catalase as well as bifunctional catalase-peroxidases or hydroperoxidases (katG). The monofunctionals come in big- and small-subunit varieties. (The biggies have subunits of 75 kDa or more and comprise just over 2100 base-pairs of DNA. The smalls have subunits under 60 kDa and typically top out at 1500 base-pairs.)

Here's what you really need to know: Within the monofunctionals, there are three clades (major subgroupings) of catalase. Clades 1 and 3 are small-subunit enzymes. Clade 1 is primarily of plant origin and is relatively rare in bacteria (the best-known examples probably being katX of Bacillus subtilis and catF of Pseudomonas syringae). Clade 3 takes in a huge number of catalases from bacteria, fungi, and various eukaryotes. (For Clade 3, think Staphylococcus catalase.)  Clade 2 is the large-subunit enzyme (think E. coli katE catalase).

The multifunctionals tend to be large (over 2100 base-pairs of DNA).

The non-heme catalases contain manganese instead of iron and are not your typical catalases. Let's leave it at that.

What do the Archaeons produce? From what little probing I've done, it seems the anaerobic Archaeons that have catalase use a modified Clade 3 type of enzyme that has little in common with other Clade 3 catalases. A few of the methane producers show good sequence agreement with Bacteroides fragilis catalase, but most anaerobic Archaeal catalases do not show good sequence concordance with any known eubacterial catalases. So it's entirely possible that a fourth clade of purely Archaeal small-subunit catalases (unlike anything else in the plant or animal worlds) awaits characterization.

The aerobic Archaeons that have catalase are all halophiles (members of the Halobacteriaceae), and all have large-subunit multifunctional peroxidases similar to those of the Cyanobacteria.

Mysteries waiting to be solved:
  • Why is it the aerobes Sulfolobus, Pyrobaculum, and Aeropyrum do not appear to have catalase? Is it that they don't have catalase, or do they have some as-yet-undiscovered new type of catalase?
  • Why is it that certain methane-generating anaerobes (e.g., Methanosarcina) have Clade 3 catalases but the rest of the methane-producing Archaea have catalases that don't match anything else in the living world? Did the former group get their catalase(s) by way of horizontal gene transfer from anaerobic eubacteria?
  • Did the multifunctional catalases of the Halobacteriaceae originally come from cyanobacteria (perhaps by way of plasmids)?
  • What overlap, if any, exists between Archaeal catalases and the catalases of algal chloroplasts?
If you find the answers to any of these, let me know!

Saturday, June 08, 2013

Strict Anaerobes that Produce Catalase

One thing every new bacteriology student learns on Day One is that some microbes are strict anaerobes (completely unable to use oxygen), and a universal characteristic of strict anaerobes is that they lack an important enzyme called catalase that breaks down hydrogen peroxide to oxygen and water. The idea is that anaerobes don't need to have catalase, because they don't live in the kind of highly oxidized environments where hydrogen peroxide forms. Lack of catalase is supposedly why many anaerobes are killed upon exposure to air. According to legend, once oxygen gets into the cells, hydrogen peroxide starts to build up, and with no catalase to break it down, anaerobes choke on toxic peroxides.

I'll let you in on a little secret, though. This nice-sounding story (about peroxide buildup killing anaerobes upon exposure to air) turns out to be mostly conjecture, not well supported by science. Even the bit about anaerobes lacking catalase isn't completely true. Many anaerobes do make catalase.

For today's post, I did a protein-sequence BLAST search against several families of obligate anaerobes using the katA gene of Proteus mirabilis as a reference, and I was quickly able to identify two dozen strict anaerobes that do, in fact, have a catalase gene (see table below).

Table 1: Strict Anaerobes that Produce Catalase
(tblastn query: Proteus mirabilis katA gene)

Length (AA)
Percent identities
Alkaliphilus metalliredigens strain QYMF
Anaerococcus prevotii strain DSM 20548
Anaerococcus vaginalis strain ATCC 51170
Bacteroides coprocola strain DSM 17136
Bacteroides coprophilus strain DSM 18228
Bacteroides eggerthii strain 1_2_48FAA
Bacteroides intestinalis strain DSM 17393
Bacteroides ovatus strain 3_8_47FAA
Bacteroides plebeius strain DSM 17135
Bacteroides thetaiotaomicron strain VPI-5482
Clostridium botulinum A3 strain Loch Maree
Clostridium botulinum B1 strain Okra
Clostridium hathewayi strain WAL-18680
Clostridium lentocellum strain DSM 5427
Clostridium phytofermentans strain ISDg
Desulfitobacterium dichloroeliminans strain LMG P-21439
Desulfitobacterium hafniense DCB-2
Desulfosporosinus youngiae strain DSM 17734
Desulfotomaculum ruminis strain DSM 2154
Dethiobacter alkaliphilus strain AHT 1
Lachnospiraceae bacterium strain 3_1_57FAA_CT1
Propionibacterium acnes strain 266
Syntrophobotulus glycolicus strain DSM 8271
Veillonella sp. strain 3_1_44

Each entry in this table represents a protein-sequence (not DNA sequence) match between a gene in the organism listed and the catalase gene of Proteus mirabilis. (Proteus is a facultative anaerobe related to E. coli and Salmonella.) The length of each organism's catalase enzyme, in amino acids, is shown under Length. (By way of reference, the Proteus catalase is 484 amino acids long.) E-value is the so-called expectation value, a measure of how likely the sequence match would be by chance. All of the values shown are extraordinarily low. "Percent identities" is the percentage of amino-acid matches between the Proteus enzyme and the target organism's enzyme. Values in the 30% to 40% range are not unusual for functionally related enzymes in otherwise distantly related organisms. Values above 60% tend to suggest a phylogenetic relationship, whereas in two organisms that are known to be unrelated, a value above 70% would (in many cases) be considered evidence of possible horizontal gene transfer. 

Here's the protein-blast query sequence I used, in case you want to verify these results (or go looking for more catalase-producing anaerobes):

>Proteus mirabilis strain HI4320(v1, unmasked), Name: PMI1740, YP_002151471.1, katA, Type: CDS, Feature Location: (Chr: 1, 1861974..1863428) Genomic Location: 1861974-1863428

ADDENDUM: After writing this post, I found that catalase also occurs in archeons. See this post for details.

Saturday, June 01, 2013

A New Biological Constant?

Earlier, I gave evidence for a surprising relationship between the amount of G+C (guanine plus cytosine) in DNA and the amount of "purine loading" on the message strand in coding regions. The fact that message strands are often purine-rich is not new, of course; it's called Szybalski's Rule. What's new and unexpected is that the amount of G+C in the genome lets you predict the amount of purine loading. Also, Szybalski's rule is not always right.

Genome A+T content versus message-strand purine content (A+G) for 260 bacterial genera. Chargaff's second parity rule predicts a horizontal line at Y = 0.50. (Szybalski's rule says that all points should lie at or above 0.50.) Surprisingly, as A+T approaches 1.0, A/T approaches the Golden Ratio.
When you look at coding regions from many different bacterial species, you find that if a species has DNA with a G+C content below about 68%, it tends to have more purines than pyrimidines on the message strand (thus purine-rich mRNA). On the other hand, if an organism has extremely GC-rich DNA (G+C > 68%), a gene's message strand tends to have more pyrimidines than purines. What it means is that Szybalski's Rule is correct only for organisms with genome G+C content less than 68%. And Chargaff's second parity rule (which says that A=T an G=C even within a single strand of DNA) is flat-out wrong all the time, except at the 68% G+C point, where Chargaff is right now and then by chance.

Since the last time I wrote on this subject, I've had the chance to look at more than 1,000 additional genomes. What I've found is that the relationship between purine loading and G+C content applies not only to bacteria (and archaea) and eukaryotes, but to mitochondrial DNA, chloroplast DNA, and virus genomes (plant, animal, phage), as well.

The accompanying graphs tell the story, but I should explain a change in the way these graphs are prepared versus the graphs in my earlier posts. Earlier, I plotted G+C along the X-axis and purine/pyrmidine ratio on the Y-axis. I now plot A+T on the X-axis instead of G+C, in order to convert an inverse relationship to a direct relationship. Also, I now plot A+G (purines, as a mole fraction) on the Y-axis. Thus, X- and Y-axes are now both expressed in mole fractions, hence both are normalized to the unit interval (i.e., all values range from 0..1).

The graph above shows the relationship between genome A+T content and purine content of message strands in genomes for 260 bacterial genera. The straight line is regression-fitted to minimize the sum of squared absolute error. (Software by The line conforms to:

y = a + bx
a =  0.45544384965539358
b =  0.14454244707261443

The line predicts that if a genome were to consist entirely of G+C (guanine and cytosine), it would be 45.54% guanine, whereas if (in some mythical creature) the genome were to consist entirely of A+T (adenine and thymine), adenine would comprise 59.99% of the DNA. Interestingly, the 95% confidence interval permits a value of 0.61803 at X = 1.0, which would mean that as guanine and cytosine diminish to zero, A/T approaches the Golden Ratio.

Do the most primitive bacteria (Archaea) also obey this relationship? Yes, they do. In preparing the graph below, I analyzed codon usage in 122 Archaeal genera to obtain A, G, T,  and C relative proportions in coding regions of genes. As you can see, the same basic relationship exists between purine content and A+T in Archaea as in Eubacteria. Regression analysis yielded a line with a slope of 0.16911 and a vertical offset 0.45865. So again, it's possible (or maybe it's just a very strange coincidence) that A/T approaches the Golden Ratio as A+T approaches unity.

Analysis of coding regions in 122 Archaea reveals that the same relationship exists between A+T content and purine mole-fraction (A+G) as exists in eubacteria.
For the graph below, I analyzed 114 eukaryotic genomes (everything from fungi and protists to insects, fish, worms, flowering and non-flowering plants, mosses, algae, and sundry warm- and cold-blooded animals). The slope of the generated regression line is 0.11567 and the vertical offset is 0.46116.

Eukaryotic organisms (N=114).

Mitochondria and chloroplasts (see the two graphs below) show a good bit more scatter in the data, but regression analysis still comes back with positive slopes (0.06702 and .13188, respectively) for the line of least squared absolute error.

Mitochondrial DNA (N=203).
Chloroplast DNA (N=227).
To see if this same fundamental relationship might hold even for viral genetic material, I looked at codon usage in 229 varieties of bacteriophage and 536 plant and animal viruses ranging in size from 3Kb to over 200 kilobases. Interestingly enough, the relationship between A+T and message-strand purine loading does indeed apply to viruses, despite the absence of dedicated protein-making machinery in a virion.

Plant and animal viruses (N=536).
Bacteriophage (N=229).
For the 536 plant and animal viruses (above left), the regression line has a slope of 0.23707 and meets the Y-axis at 0.62337 when X = 1.0. For bacteriophage (above right), the line's slope is 0.13733 and the vertical offset is 0.46395. (When inspecting the graphs, take note that the vertical-axis scaling is not the same for each graph. Hence the slopes are deceptive.) The Y-intercept at X = 1.0 is 0.60128. So again, it's possible A/T approaches the golden ratio as A+T approaches 100%.

The fact that viral nucleic acids follow the same purine trajectories as their hosts perhaps shouldn't come as a surprise, because viral genetic material is (in general) highly adapted to host machinery. Purine loading appropriate to the A+T milieu is just another adaptation.

It's striking that so many genomes, from so many diverse organisms (eubacteria, archaea, eukaryotes, viruses, bacteriophages, plus organelles), follow the same basic law of approximately

A+G = 0.46 + 0.14 * (A+T)

The above law is as universal a law of biology as I've ever seen. The only question is what to call the slope term. It's clearly a biological constant of considerable significance. Its physical interpretation is clear: It's the rate at which purines are accumulated in mRNA as genome A+T content increases. It says that a 1% increase in A+T content (or a 1% decrease in genome  G+C content) is worth a 0.14% increase in purine content in message strands. Maybe it should be called the purine rise rate? The purine amelioration rate?

Biologists, please feel free to get in touch to discuss. I'm interested in hearing your ideas. Reach out to me on LinkedIn, or simply leave a comment below.