blogorrhea: GC content

Showing posts with label GC content. Show all posts

Friday, May 30, 2014

Comparative Genomics Using the Canvas API

An interesting and somewhat mysterious aspect of biodiversity is that the relative proportions of the bases in DNA can take on wildly different values in different organisms even though they're making many of the same proteins. I'm referring to the fact that the G+C (guanine plus cytosine) content of DNA can vary from more than 70% (e.g., Streptomyces species) to less than 30% for certain bacterial endosymbionts (and even for some free-living bacteria, such as Clostridium botulinum). Remarkably, the DNA of the tiny bacterium Buchnera aphidicola (which is distantly related to E. coli but entered into a symbiotic partnership with the aphid around 200 million years ago) has a GC content of only 26%, making its DNA look almost like a two-letter code (A and T, with the occasional G or C).

Many genome-reduced endosymbionts have lost most of their DNA repair enzymes (this is true of mitochondria, incidentally, which are thought to have arisen from a symbiosis with an ancient ancestor of today's Alphabproteobacteria), and this loss of repair capability could well explain much of the GC-to-AT shift seen in endosymbiont genomes. (Left on its own, DNA tends to accumulate 8-oxo-guanine residues, which incorrectly pair with thymine and result in GC-to-TA transversions at DNA replication time.) Whatever the cause(s), GC reduction has left many organisms with severely A- and T-enriched DNA. But the organisms in question are still able to encode perfectly functional proteins, even with a limited nucleic-acid vocabulary.

I decided it might be interesting to try to visualize "GC diversity" by creating a heat map of G and C (in hot colors) and A and T (in cool colors) for certain genes that occur across all bacteria. For example, the following graphic is a heat map of GC and AT usage in the gene for thymidine kinase as it occurs in 61 different organisms with genomic G+C ranging from just over 70% to just under 30%.

Thymidine kinase gene for 61 bacteria of widely varying genomic GC percentages. Guanine and cytosine are represented by hot colors, adenine and thymine by cool colors. Alignments were done with MEGA6 software via ClustalW using a relatively permissive gap-opening penalty of 9 and a gap-extension penalty of 5. The heat map was created with JavaScript using the Canvas API.

To create this map, I obtained DNA sequences for thymidine kinase from 61 organisms (using the excellent online tools at UniProt.org and genomevolution.org), then aligned them in MEGA6 and drew colors (red or red-orange for G or C, blue or blue-green for A or T) corresponding to the bases, using the Canvas API. What you're looking at are 61 rows of data (one row per gene; which is also to say, per organism). Gaps created during alignment are shown in grey.

Several things are apparent from this graphic, aside from the obvious fact that GC usage (indicated by red and orange) tends to be high in the genes for organisms like Brachybacterium faecium (top line, GC 72%) and low for bottom-of-the-chart organisms like Clostridium perfringens B strain ATCC 3626 (28.7% GC) and Ureaplasma urealyticum (25.9%). First, high-GC genes tend to be somewhat longer. (Indeed, in the upper right you can see that the longest genes opened up a sizable alignment gap that extends down the whole graphic.) Also, the genes differ substantially in their leader and trailer sequences, although I think what we're really seeing here is (at least in part) inaccurate annotation of start and stop codons. What's interesting to me is the way certain GC regions trail all the way down to the bottom of the graph while others fade to blue. I think it could be argued that the nucleotides represented by the red blips in the final few lines of the graph, at the bottom, are positions in the gene that are under strong functional constraints. It would be interesting to test those positions for evidence of selection pressure. It could be argued that all areas of low selection pressure have turned blue by the time you get to the bottom of the graph.

I think it's also interesting that several red-orange clusters just to the right of the midpoint, about three-quarters of the way down the graph, all but disappear just as a new red-orange zone begins to appear on the left around 80% of the way down. It's as if certain GC-to-AT mutations in one protein domain led to AT-to-GC mutations in another domain upstream. (The 5' side of the graph is on the left, 3' on the right.)

Getting this many genes (from such divergent sources) to align is not easy. You can do it, though, by setting the gap-open penalty in ClustalW to a low value and aligning genes in small batches, row by row, if need be.

As a technical aside: I first tried creating this graph using SVG (Scalable Vector Graphics), but the burden of creating a separate DOM node for every pixel (which is what it amounts to) was way too much for the browser to handle (Firefox choked, as did Chrome), so I quickly switched to the Canvas API, which puts no heavy DOM burdens on the browser and can convert a FASTA-formatted alignment file to a nice picture in about two seconds.

For what it's worth, here are the names of the organisms whose genes appeared in the above graphic, arranged in order of GC content of the kinase gene only (not whole-genome GC). High-GC organisms are listed first:

Blastococcus saxobsidens strain DD2
Geodermatophilus obscurus strain DSM 43160
Streptomyces cf. griseus strain XylebKG-1
Streptosporangium roseum strain DSM 43021
Kribbella flavida strain DSM 17836
Brachybacterium faecium strain DSM 4810
Deinococcus radiodurans strain R1
Gordonia bronchialis strain DSM 43247
Mesorhizobium australicum strain WSM2073
Turneriella parva strain DSM 21527
Mesorhizobium ciceri biovar biserrulae strain WSM1271
Propionibacterium acnes TypeIA2 strain P.acn33
Novosphingobium aromaticivorans strain DSM 12444
Geobacillus kaustophilus strain HTA426
Geobacillus thermoleovorans strain CCB_US3_UF5
Halogeometricum borinquense DSM 11551
Alistipes finegoldii strain DSM 17242
Agrobacterium radiobacter strain K84
Rhizobium tropici strain CIAT 899
Aeromonas hydrophila strain ML09-119
Rhodopirellula baltica SH strain 1
Porphyromonas gingivalis strain ATCC 33277
Dyadobacter fermentans strain DSM 18053
Enterobacter cloacae strain SCF1
Paenibacillus polymyxa strain M1
Parabacteroides distasonis strain ATCC 8503
Bacillus amyloliquefaciens strain Y2
Vibrio cholerae strain BX 330286
Erwinia amylovora strain ATCC 49946
Bacillus subtilis BEST7613 strain PCC 6803
Gramella forsetii strain KT0803
Prevotella copri strain DSM 18205
Anaerolinea thermophila strain UNI-1
Klebsiella oxytoca strain 10-5243
Bacteroides ovatus strain 3_8_47FAA
Aggregatibacter phage S1249
Escherichia coli B strain REL606
Shigella boydii strain Sb227
Psychroflexus torquis strain ATCC 700755
Bacteroides dorei strain 5_1_36/D4
Leuconostoc gasicomitatum LMG 18811 strain type LMG 18811
Mycoplasma gallisepticum strain F
Myroides odoratimimus strain CIP 101113
Oceanobacillus kimchii strain X50
Chryseobacterium gleum strain ATCC 35910
Carboxydothermus hydrogenoformans strain Z-2901
Yersinia pestis D106004
Bacillus thuringiensis serovar andalousiensis strain BGSC 4AW1
Bacillus anthracis strain CDC 684
Coprobacillus sp. strain 8_2_54BFAA
Proteus mirabilis strain HI4320
Staphylococcus aureus strain 04-02981
Aerococcus urinae strain ACS-120-V-Col10a
Lactobacillus acidophilus strain 30SC
Lactobacillus reuteri strain MM4-1A
Lactococcus lactis subsp. cremoris strain A76
Haemophilus ducreyi strain 35000HP
Ureaplasma urealyticum serovar 5 strain ATCC 27817
Clostridium botulinum A strain Hall
Streptococcus agalactiae strain 2603V/R
Clostridium perfringens B strain ATCC 3626

Saturday, May 11, 2013

DNA G+C Content and Survival Value

One of biology's big open questions is why organisms differ so much with regard to the relative amounts of GC and AT in their DNA. You'd think that if there are only two kinds of DNA base pairs (see diagram) they'd be more-or-less equally abundant. Not so. There are organisms with DNA that's mostly GC (and/or CG) pairs; there are organisms with very-AT-rich DNA; and within the chromosomes of higher organisms you find large GC-rich regions (isochores) in the midst of great swaths of AT-rich DNA.

DNA contains adenine and thymine in equal amounts, and
guanine and cytosine in equal amounts, but it does not
usually contain GC pairs and AT pairs in equal amounts. And
it doesn't seem as if there is an "optimum" GC:AT ratio. The
GC:AT ratio varies by species. Within a species, it's constant.

There are two really odd facts at work here:

1. The GC content of DNA varies by species, and it varies a lot.

2. Evolution doesn't seem to trend toward an "optimum CG:AT ratio" of any kind.

If there were such thing as an optimum GC:AT ratio for DNA, surely microorganisms would've figured it out by now. Instead, we find huge diversity: There are bacteria on every point in the GC% spectrum, running from 16% GC for the DNA of Candidatus Carsonella ruddii (a symbiont of the jumping plant louse) to 75% for Anaeromyxobacter dehalogenans 2CP-C (a soil bacterium). At each end of the spectrum you find aerobes and anaerobes; extremophiles and blandophiles; pathogens and non-pathogens. About the only generalization you can make is that the smaller an organism's genome is, the more likely it is to be rich in A+T (low GC%).

Genome size correlates loosely with GC content. The very smallest
bacteria tend to have AT-rich (low GC%) DNA.

The huge diversity in GC:AT ratios among bacteria is impressive. But does it simply represent a random walk all over the possibility-space of DNA? Or do the various points on the spectrum constitute special niches with important advantages? What advantage could there be for having high-GC% DNA? Or high-AT% DNA?

Some subtle clues tell us that this is not just random deviation from the mean. First, suppose we agree for sake of argument that lateral gene transfer (LGT) is common in the microbial world (a point of view I happen to agree with). Over the course of millions of years, with pieces of DNA of all kinds (high GC%, low GC%) flying back and forth, LGT should force a regression to the mean: It should make genomes tend toward a 50-50 GC:AT ratio. That clearly hasn't happened.

And then there's ordinary mutational pressures. It's beginning to be fairly well accepted (see Hershberg and Petrov, "Evidence That Mutation is Universally Biased Toward AT in Bacteria," PLoS Genetics, 2010, 6:9, e1001115, full version here) that natural mutation is strongly biased in the direction of AT by virtue of the fact that deamination of cytosine and methylcytosine (which occurs spontaneously at high frequency) leads to replacement of 'C' with 'T', hence GC pairs becoming AT pairs. The strong natural mutational bias toward AT says that all DNA should creep in the direction of low GC% and end up well below 50% GC. But again, this is not what we see. We see that high-GC organisms like Anaeromyxobacter (and many others) maintain their DNA's unusually high (75%) GC content across millions of generations. Even middle-of-the-road organisms like E. coli (with 50% GC content) don't slowly slip in the direction of high-AT/low-GC.

Clearly, something funny is going on. For a super-high-GC organism like Anaeromyxobacter to maintain its DNA's super-high GC content against the constant tug of mutations in the AT direction, it must be putting significant energy into maintaining that high GC percentage. But why? Why pay extra to maintain a high GC%? And how does the cost get paid?

I think I've come up with a possible answer. It has to do with DNA replication cost, where "cost" is figured in terms of time needed to synthesize a new copy of the DNA (for cell division). Anything that favors low replication cost (high replication speed) should favor survival; that's my main assumption.

My other assumption is that DNA polymerases (the enzymes involved in replication) are not clairvoyant. They can't know, until the need arises, which of the four deoxyribonucleotide triphosphates (dATP, dTTP, dGTP, dCTP) will be needed at a given moment, to elongate the new strand of DNA. When the need arises for (let's say) an 'A', the 'A' (in the form of dATP) has to come from an existing endogenous pool of dNTPs containing all four bases (dATP, dTTP, dGTP, dCTP) in whatever concentrations they're in. The enzyme has to wait until a dATP (if that's what's needed) randomly happens to lock into the active site. Odds are only one in four (assuming equal concentrations of dNTPs) of a dATP coming along at exactly the right moment. Odds are 3 out of 4 that some incorrect dNTP (either dGTP, dTTP, or dCTP) will try, and fail, to fit the active site first, before dATP comes along.

But imagine that your DNA is 75% G+C. And suppose you've regulated your intracellular metabolism to maintain dGTP and dCTP in a 3:1 ratio over dATP and dTTP. The odds of a good random "first hit" go up.

To simulate the various possibilities, I wrote software (in JavaScript) that simulates DNA replication, where the template DNA molecule is 1000 base-pairs in length and the dNTP pool size is 10000 bases. The software allows you to set the organism's genome GC% to whatever you want, and also set the dNTP pool's relative GC percentage to whatever you want. The template DNA is just a random string of A, T, G, and C bases (1000 total), reflecting their relative abundances as set in the GC% parameter. The pool of dNTPs is set up to be a randomized array (again reflecting abundances set in a GC% parameter).

The way the software works is this. Read a base off the template. Fetch a base randomly from the base pool. If the base happens to be the one (out of four) that's called for, score '1' for the timing parameter, and continue to read another base off the template. If the base was not the one that's called for, put it back in the pool array in a random location, then randomly fetch another base from the pool; and increment the timing parameter. (For each fetch, the timing parameter goes up by 1.) Keep fetching (and throwing back bases) until the proper base comes up, incrementing the time parameter as appropriate. (The time parameter keeps track of the number of fetch attempts.) When the correct base turns up, the pool shrinks by one base. In other words, replication consumes the pool, but as I said earlier, the pool contains ten times as many bases (to start) as the DNA template. So the pool ends up 10% smaller at the end of replication.

Each point on this graph represents the average of 100 Monte Carlo runs, each run representing complete replication of a 1000-bp DNA template, drawing from a pool of 10,000 bases. The blue points are runs that used a DNA template containing 25% G+C content. The red points are runs that used DNA with 75% G+C. The X-axis represents different base-pool compositions. See text for details. Click for larger image.

I ran Monte Carlo simulations for DNA templates having GC contents of 75%, 50%, and 25%, using base pools set up to have anywhere from 15% GC to 85% (in 2.5% increments). The results for the 75% GC and 25% GC templates (representing high- and low-GC organisms) are shown in the above graph. Each point on the graph represents the average of 100 complete replication runs. The Y-axis shows the average number of fetches per DNA base (so, a low value means fast replication; a high value means slower DNA replication). The X-axis shows the percentage of GC in the base-pool, in recognition of the fact that relative dNTP abundances in an organism may vary, in accordance with environmental constraints as well as with organism-specific homeostatic setpoints.

Maximal replication speed (the low point of each curve) happens at a base-pool GC percentage that is displaced in the direction of the DNA's own GC%. So, for the 25%-GC organism (blue data points), max replication efficiency comes when the base-pool is about 33% GC. For the 75% GC organism (red points) the sweet spot is at a base-pool GC concentration of 65%. (Why this is not exactly symmetrical with the other curve, I don't know; but bear in mind, these are Monte Carlo runs. Some variation is to be expected.)

The interesting thing to note is that max replication efficiency, for each organism, comes at 3.73 fetches per base-pair (Y-axis). Cache that thought. It'll be important in a minute.

The real jaw-dropper is what happens when you plot a curve for template DNA with 50% GC content. In the graph below, I've shown the 50%-GC runs as black points. (The red and blue points are exactly as before.)

This is the same graph as before, but with replication data for a 50%-GC genome (black points). Again, each data point represents the average of 100 Monte Carlo runs. Notice that the black curve bottoms out at a higher level (4.0) than the red or blue curves (3.73). This means replication is less efficient for the 50%-GC genome.

Notice that the best replication efficiency comes in the middle of the graph (no big surprise), but check the Y-value: 4.00. The very fastest DNA replication, when the DNA template is 50% GC, requires 4 fetches per base, compared to best-case base-fetching efficiency of 3.73 for the 25%-GC and 75%-GC DNAs.What does this mean? It means DNA replication, in a best-case scenario, is 6.75% more efficient for the skewed-GC organisms. (The difference between 3.73 and 4.00 is 6.75%.)

This goes a long way toward explaining why GC extremism is stable in organisms that pursue it. There is replication efficiency to be had in keeping your DNA biased toward high or low GC. (It doesn't seem to matter which.)

Consider the dynamics of an ATP drawdown. The energy economy of a cell revolves around ATP, which is both an energy molecule and a source for the adenine that goes into DNA and RNA. One would expect normal endogenous concentrations of ATP to be high relative to other NTPs. For a low-GC% organism, that's also a near-ideal situation for DNA replication, because high AT in the base pool puts you near the max-replication-speed part of the curve (see blue points). A sudden drawdown in ATP (when the cell is in crisis) shifts replication speed to the right-hand part of the blue curve, slowing replication significantly. This is what you want if you're an intracellular symbiont (or a mitochondrion, incidentally). You want to stop dividing when the host cell is unable to divide because of an energy crisis.

Consider the high-GC organism (red dots), on the other hand. If ATP levels are high during normal metabolism, replication is not as efficient as it could be, but so what? It just means you're willing to tolerate less-efficient replication in good times. But as ATP draws down (perhaps because nutrients are becoming scarce), DNA replication actually becomes more efficient. This is what you want if you're a free-living organism in the wild. You want to be able to continue replicating your DNA even as ATP becomes scarce. And indeed that's what happens (according to the red data points): As the base pool becomes more GC-rich, replication efficiency increases. The best efficiency comes when base-pool A+T is down around 35%.

I think these simulations are meaningful and I think they help explain the DNA-composition extremism seen among microorganisms. If you're a professional scientist and you find these results tantalizing, and you'd like to co-author a paper for PLoS Genetics (or another journal), please get in touch. (My Google mail is kas-dot-e-dot-thomas.) I'd like to coauthor with someone who is good with statistics, who can contribute more ideas to this line of investigation. I think these results are worth sharing with the scientific community at large.