blogorrhea: E. coli

One of biology's big open questions is why organisms differ so much with regard to the relative amounts of GC and AT in their DNA. You'd think that if there are only two kinds of DNA base pairs (see diagram) they'd be more-or-less equally abundant. Not so. There are organisms with DNA that's mostly GC (and/or CG) pairs; there are organisms with very-AT-rich DNA; and within the chromosomes of higher organisms you find large GC-rich regions (isochores) in the midst of great swaths of AT-rich DNA.

DNA contains adenine and thymine in equal amounts, and
guanine and cytosine in equal amounts, but it does not
usually contain GC pairs and AT pairs in equal amounts. And
it doesn't seem as if there is an "optimum" GC:AT ratio. The
GC:AT ratio varies by species. Within a species, it's constant.

There are two really odd facts at work here:

1. The GC content of DNA varies by species, and it varies a lot.

2. Evolution doesn't seem to trend toward an "optimum CG:AT ratio" of any kind.

If there were such thing as an optimum GC:AT ratio for DNA, surely microorganisms would've figured it out by now. Instead, we find huge diversity: There are bacteria on every point in the GC% spectrum, running from 16% GC for the DNA of Candidatus Carsonella ruddii (a symbiont of the jumping plant louse) to 75% for Anaeromyxobacter dehalogenans 2CP-C (a soil bacterium). At each end of the spectrum you find aerobes and anaerobes; extremophiles and blandophiles; pathogens and non-pathogens. About the only generalization you can make is that the smaller an organism's genome is, the more likely it is to be rich in A+T (low GC%).

Genome size correlates loosely with GC content. The very smallest
bacteria tend to have AT-rich (low GC%) DNA.

The huge diversity in GC:AT ratios among bacteria is impressive. But does it simply represent a random walk all over the possibility-space of DNA? Or do the various points on the spectrum constitute special niches with important advantages? What advantage could there be for having high-GC% DNA? Or high-AT% DNA?

Some subtle clues tell us that this is not just random deviation from the mean. First, suppose we agree for sake of argument that lateral gene transfer (LGT) is common in the microbial world (a point of view I happen to agree with). Over the course of millions of years, with pieces of DNA of all kinds (high GC%, low GC%) flying back and forth, LGT should force a regression to the mean: It should make genomes tend toward a 50-50 GC:AT ratio. That clearly hasn't happened.

And then there's ordinary mutational pressures. It's beginning to be fairly well accepted (see Hershberg and Petrov, "Evidence That Mutation is Universally Biased Toward AT in Bacteria," PLoS Genetics, 2010, 6:9, e1001115, full version here) that natural mutation is strongly biased in the direction of AT by virtue of the fact that deamination of cytosine and methylcytosine (which occurs spontaneously at high frequency) leads to replacement of 'C' with 'T', hence GC pairs becoming AT pairs. The strong natural mutational bias toward AT says that all DNA should creep in the direction of low GC% and end up well below 50% GC. But again, this is not what we see. We see that high-GC organisms like Anaeromyxobacter (and many others) maintain their DNA's unusually high (75%) GC content across millions of generations. Even middle-of-the-road organisms like E. coli (with 50% GC content) don't slowly slip in the direction of high-AT/low-GC.

Clearly, something funny is going on. For a super-high-GC organism like Anaeromyxobacter to maintain its DNA's super-high GC content against the constant tug of mutations in the AT direction, it must be putting significant energy into maintaining that high GC percentage. But why? Why pay extra to maintain a high GC%? And how does the cost get paid?

I think I've come up with a possible answer. It has to do with DNA replication cost, where "cost" is figured in terms of time needed to synthesize a new copy of the DNA (for cell division). Anything that favors low replication cost (high replication speed) should favor survival; that's my main assumption.

My other assumption is that DNA polymerases (the enzymes involved in replication) are not clairvoyant. They can't know, until the need arises, which of the four deoxyribonucleotide triphosphates (dATP, dTTP, dGTP, dCTP) will be needed at a given moment, to elongate the new strand of DNA. When the need arises for (let's say) an 'A', the 'A' (in the form of dATP) has to come from an existing endogenous pool of dNTPs containing all four bases (dATP, dTTP, dGTP, dCTP) in whatever concentrations they're in. The enzyme has to wait until a dATP (if that's what's needed) randomly happens to lock into the active site. Odds are only one in four (assuming equal concentrations of dNTPs) of a dATP coming along at exactly the right moment. Odds are 3 out of 4 that some incorrect dNTP (either dGTP, dTTP, or dCTP) will try, and fail, to fit the active site first, before dATP comes along.

But imagine that your DNA is 75% G+C. And suppose you've regulated your intracellular metabolism to maintain dGTP and dCTP in a 3:1 ratio over dATP and dTTP. The odds of a good random "first hit" go up.

To simulate the various possibilities, I wrote software (in JavaScript) that simulates DNA replication, where the template DNA molecule is 1000 base-pairs in length and the dNTP pool size is 10000 bases. The software allows you to set the organism's genome GC% to whatever you want, and also set the dNTP pool's relative GC percentage to whatever you want. The template DNA is just a random string of A, T, G, and C bases (1000 total), reflecting their relative abundances as set in the GC% parameter. The pool of dNTPs is set up to be a randomized array (again reflecting abundances set in a GC% parameter).

The way the software works is this. Read a base off the template. Fetch a base randomly from the base pool. If the base happens to be the one (out of four) that's called for, score '1' for the timing parameter, and continue to read another base off the template. If the base was not the one that's called for, put it back in the pool array in a random location, then randomly fetch another base from the pool; and increment the timing parameter. (For each fetch, the timing parameter goes up by 1.) Keep fetching (and throwing back bases) until the proper base comes up, incrementing the time parameter as appropriate. (The time parameter keeps track of the number of fetch attempts.) When the correct base turns up, the pool shrinks by one base. In other words, replication consumes the pool, but as I said earlier, the pool contains ten times as many bases (to start) as the DNA template. So the pool ends up 10% smaller at the end of replication.

Each point on this graph represents the average of 100 Monte Carlo runs, each run representing complete replication of a 1000-bp DNA template, drawing from a pool of 10,000 bases. The blue points are runs that used a DNA template containing 25% G+C content. The red points are runs that used DNA with 75% G+C. The X-axis represents different base-pool compositions. See text for details. Click for larger image.

I ran Monte Carlo simulations for DNA templates having GC contents of 75%, 50%, and 25%, using base pools set up to have anywhere from 15% GC to 85% (in 2.5% increments). The results for the 75% GC and 25% GC templates (representing high- and low-GC organisms) are shown in the above graph. Each point on the graph represents the average of 100 complete replication runs. The Y-axis shows the average number of fetches per DNA base (so, a low value means fast replication; a high value means slower DNA replication). The X-axis shows the percentage of GC in the base-pool, in recognition of the fact that relative dNTP abundances in an organism may vary, in accordance with environmental constraints as well as with organism-specific homeostatic setpoints.

Maximal replication speed (the low point of each curve) happens at a base-pool GC percentage that is displaced in the direction of the DNA's own GC%. So, for the 25%-GC organism (blue data points), max replication efficiency comes when the base-pool is about 33% GC. For the 75% GC organism (red points) the sweet spot is at a base-pool GC concentration of 65%. (Why this is not exactly symmetrical with the other curve, I don't know; but bear in mind, these are Monte Carlo runs. Some variation is to be expected.)

The interesting thing to note is that max replication efficiency, for each organism, comes at 3.73 fetches per base-pair (Y-axis). Cache that thought. It'll be important in a minute.

The real jaw-dropper is what happens when you plot a curve for template DNA with 50% GC content. In the graph below, I've shown the 50%-GC runs as black points. (The red and blue points are exactly as before.)

This is the same graph as before, but with replication data for a 50%-GC genome (black points). Again, each data point represents the average of 100 Monte Carlo runs. Notice that the black curve bottoms out at a higher level (4.0) than the red or blue curves (3.73). This means replication is less efficient for the 50%-GC genome.

Notice that the best replication efficiency comes in the middle of the graph (no big surprise), but check the Y-value: 4.00. The very fastest DNA replication, when the DNA template is 50% GC, requires 4 fetches per base, compared to best-case base-fetching efficiency of 3.73 for the 25%-GC and 75%-GC DNAs.What does this mean? It means DNA replication, in a best-case scenario, is 6.75% more efficient for the skewed-GC organisms. (The difference between 3.73 and 4.00 is 6.75%.)

This goes a long way toward explaining why GC extremism is stable in organisms that pursue it. There is replication efficiency to be had in keeping your DNA biased toward high or low GC. (It doesn't seem to matter which.)

Consider the dynamics of an ATP drawdown. The energy economy of a cell revolves around ATP, which is both an energy molecule and a source for the adenine that goes into DNA and RNA. One would expect normal endogenous concentrations of ATP to be high relative to other NTPs. For a low-GC% organism, that's also a near-ideal situation for DNA replication, because high AT in the base pool puts you near the max-replication-speed part of the curve (see blue points). A sudden drawdown in ATP (when the cell is in crisis) shifts replication speed to the right-hand part of the blue curve, slowing replication significantly. This is what you want if you're an intracellular symbiont (or a mitochondrion, incidentally). You want to stop dividing when the host cell is unable to divide because of an energy crisis.

Consider the high-GC organism (red dots), on the other hand. If ATP levels are high during normal metabolism, replication is not as efficient as it could be, but so what? It just means you're willing to tolerate less-efficient replication in good times. But as ATP draws down (perhaps because nutrients are becoming scarce), DNA replication actually becomes more efficient. This is what you want if you're a free-living organism in the wild. You want to be able to continue replicating your DNA even as ATP becomes scarce. And indeed that's what happens (according to the red data points): As the base pool becomes more GC-rich, replication efficiency increases. The best efficiency comes when base-pool A+T is down around 35%.

I think these simulations are meaningful and I think they help explain the DNA-composition extremism seen among microorganisms. If you're a professional scientist and you find these results tantalizing, and you'd like to co-author a paper for PLoS Genetics (or another journal), please get in touch. (My Google mail is kas-dot-e-dot-thomas.) I'd like to coauthor with someone who is good with statistics, who can contribute more ideas to this line of investigation. I think these results are worth sharing with the scientific community at large.

I'm about to offer a conjecture that is a bit preposterous-sounding but could well hold true. I actually think it does.

I propose that evolution, at the level of bacteria (though probably not at higher levels), is driven by hydrogen peroxide.

This theory rests on three assumptions: One is that the creation of new bacterial species happens almost entirely via lateral gene transfer, not heritable point-mutations. Secondly, bacteria (marine and terrestrial) are regularly exposed to challenges by hydrogen peroxide in the environment. Thirdly, those challenges drive lateral gene transfer.

Evidence for the first assumption is embarrassingly abundant. If you're not up to speed on the subject, I suggest you read the excellent paper, "Lateral Gene Transfer," by Olga Zhaxybayeva and W. Ford Doolittle in Current Biology, April 2011, 21:7, pp. R242-246 (unlocked copy here). It's now common to find that any given bacterial species can trace a good percentage of its protein base to "ancestors" that are too far removed horizontally to be ancestors in the conventional sense.

Consider E. coli. There are hundreds of strains of E. coli, with genes ranging in number from 4,100 to about 5,300 per strain. The problem is, the various strains of E. coli have only about 900 genes in common (and that's far too few genes to render a fully functional E. coli). The E. coli pan-genome actually takes in more than 15,000 gene families, total. Certainly, you can draw a family tree of E. coli based on 16S ribosomal polymorphisms, but that doesn't explain where the 15,000 pan-genome genes came from. The "family tree" metaphor quickly breaks down if you start drawing trees based on proteins. You get many conflicting trees—all of them correct.

Trees like this are fiction where bacteria are concerned.
The tree of life is more like a net of life or web
of life than a directed acyclic graph.

Where are all of the genes coming from? Other species, of course. They arrive by way of mechanisms like transformation, transduction, and conjugation. all of which allow direct entry of foreign DNA into a bacterial cell. At one time it was thought that conjugation could only occur between bacteria of the same species, but it is now known that cross-species conjugation also occurs (as, for example, between E. coli and Streptomyces or Mycobacterium).

Transduction, which is where viruses package up an infected host's genes in virus capsules that are then taken up by another cell, occurs naturally in bacterial populations in response to environmental factors like ultraviolet light and hydrogen peroxide. Exposure of a virus-carrying (lysogenic) cell to UV light or peroxide can induce runaway production of virus, and in fact this mechanism is used by Streptococcus to kill competitive Staphylococcus cells, in a clever bit of chemical warfare. It's been known for years that hydrogen peroxide can cause many types of bacteria to shed DNA. Now we know why: Hydrogen peroxide is a signalling molecule. It signals (among other things) lysogenic bacteria to go into a lytic cycle. It also signals cells to mount what's known as the SOS response, which is a global response to oxidative challenge. Years ago, Bruce Ames and his colleagues showed that exposing Salmonella to very dilute (60 micromolar) hydrogen peroxide caused the cells to differentially express 30 "SOS" proteins, including heat-shock proteins and low-fidelity DNA-repair systems. We know that hydrogen peroxide as dilute as 0.1 micromolar can induce phage (virus) production in up to 11% of marine bacteria. This is significant, because rainwater contains hydrogen peroxide in concentrations of 2 to 40 micromolar, and ocean water has been known to reach millimolar levels of H2O2 after a rain storm.

If you're wondering why rain contains hydrogen peroxide, the peroxide gets there in two ways. One is UV-frequency photochemistry (where water is cleaved to H and OH, then reforms as H2 and H2O2); the other is via ionization reactions caused by lightning. (Lightning is energetic enough to bring airborne oxygen and water to a plasma state. The resulting ionization and rearrangement of free atoms yields a certain amount of hydrogen peroxide.) The presence of H2O2 in rainwater has been confirmed many times, and in fact there's a well-preserved "fossil record" of it in polar icepacks, going back centuries. (Polar snowpacks contain from 10 to 900 ppb of H2O2; it varies seasonally, the max coming in summer.)

Bottom line, every rain event (over land, over sea) constitutes a hydrogen peroxide challenge for microbes. Which induces viral transduction (and a release of whole-cell DNA through lysis, some of which will be inevitably be used in transformation). It also induces low-fidelity DNA repair (which is guaranteed to help evolution along). Every rain event, in other words, is a chance for evolution to do its thing. For bacteria, that means gene-sharing within and across species lines.

Darwin's theory of a tree-like ancestor basis
for all living things is dead wrong, at
least for bacteria.

W. Ford Doolittle (who wrote a classic book chapter about lateral gene transfer called "If the Tree of Life Fell, Would We Recognize the Sound?") estimates that if a horizontal gene transfer occurs once every ten billion vertical replications, "it would be enough to ensure that no gene in any modern genome has an unbroken history of vertical descent back to some hypothetical last universal common ancestor." (See this article.)

It's obvious (to me, at least) that every rain event carries with it the potential to cause far more gene transfers than are necessary (according to Doolittle) to make vertical inheritance fade into insignificance as an evolutionary bringer of change. The hydrogen peroxide in rain has been driving lateral gene transfer in bacteria for eons. In fact, it is arguably the dominant driver of evolution in bacteria.

Sorry, Mr. Darwin. Point mutations handed down to sons and daughters just isn't cutting it.

blogorrhea

Saturday, May 11, 2013

DNA G+C Content and Survival Value

Monday, May 06, 2013

Hydrogen Peroxide Powers Evolution

Past Posts