blogorrhea: comparative genomics

Showing posts with label comparative genomics. Show all posts

Monday, July 07, 2014

Complementing Codons: A Riddle Solved?

For some time now, I've been puzzling over a fairly big riddle, and I think an answer is becoming clear.

The riddle is: Why, in so many organisms, do codons turn up at a rate approximately equal to the rate of usage of reverse-complement codons? Take a good look at the symmetry of the following graph (of codon usage rates in Frankia alni, a bacterium that causes nitrogen-fixing nodules to appear on the roots of alder plants).

Codon usage in Frankia alni. Notice that a given codon's usage corresponds, roughly, to the rate of usage of the corresponding reverse-complement codon.

This graph of codon freequencies in F. alni shows the strange correspondence (which I've commented on before) between codons and their reverse complements. If GGC occurs at a high frequency (which it does, in this organism's protein-coding genes), the reverse-complement codon GCC is also high in frequency. If a codon (say TAA) is low, its reverse complement (TTA) is also low.

I've seen this relationship in many organisms (hundreds, by now); too often to be by chance. The question is why codons so often occur in direct proportion to the rate of occurrence of corresponding reverse complements. It doesn't make sense. The notion of base pairing should not come into play when an organism (or natural selection) chooses codons, because all of a protein gene's codons are collinear, on one and the same strand of DNA; base-pairing rules do not play a role in choosing codons.

Or do they?

I think, in fact, base-pairing does a play a role. The answer is obvious, when you think about it. We know that (single-stranded) RNA, if properly constructed, will fold back on itself to form loops and stems: complementary regions will base-pair with each other. Certainly, if secondary structure in mRNA is widespread, it will have consequences for codon selection. Codons in "stem" regions will complement each other.

And so it's fairly obvious, it seems to me, that a reasonable explanation for the riddle of "reverse complement codon selection" is that secondary structure of mRNA (or possibly single-stranded DNA) is far more pervasive than any of us might have suspected. It's pervasive enough to affect codon usage in the way shown in the graph above.

Is there any evidence that secondary structure is widespread? I think there is. If you go looking for complementary sequences inside protein-coding genes in F. alni, for example, you find many. As a probe, I had a script check for intragenic complementing length-12 sequences ("12-mers") in all 6,711 protein-coding genes of F. alni. (I presented pseudocode for the script in an earlier post.) Based on the known base-composition stats of the organism, I expected to find 5,440 such 12-mer pairs by chance. What I found was 6,319 such pairs located in 2,689 genes. (When I looked for complementing 13-mers, I expected to find 1,467 occurring by chance, but instead found 3,592 such pairs in 2,086 genes.) In a previous post, I showed similar results for Sorangium cellulosum (a bacterium with an enormous genome). Previous to that, I showed similar results for Mycoplasma genitalium (which has one of the tiniest genomes of any free-living microbe).

But do these regions of internal complementarity affect codon choice? Indeed they do. When I looked at the top 40% of F. alni genes in terms of the number of internal complementing 12-mers, I found a Pearson correlation between codons and reverse-complement codons of 0.889. Looking at the bottom 60% of genes, I found the correlation to be lower: 0.766. These numbers, moreover, were virtually unchanged (0.888 and 0.763) when I re-calculated the Pearson coefficients using expectation-adjusted codon frequencies. That is to say, I used base composition stats to "predict" the frequencies of each codon, then I subtracted the predicted number from the actual number, for each codon. (Example: The frequency of occurrence of guanine, in F. alni protein genes, is 0.35794, and the frequency of cytosine is 0.37230, hence the expected frequency of GCC is 0.35794 * 0.37230 * 0.37230, or 0.04961. The actual frequency is 0.07802.) The correlation still existed, practically unchanged, after adjusting for expected rates of occurrence of codons.

The bottom line is that the correlation between the frequency of occurrence of a given codon and the frequency of its reverse-complement codon, which is otherwise very hard to explain, is quite readily explained by the presence, in protein-coding genes, of a significant amount of single-strand complementarity (of the type that could be expected to give rise to secondary structure in mRNA). On this basis, it's reasonable to suppose that conserved secondary structure is actually a major driver of codon usage bias.

Please show this post to your biogeek friends; thanks!

Thursday, June 05, 2014

A Manganese Catalase Fusion Protein

In science, it often happens that finding the answer to a particular mystery only leads to further questions. That's certainly the case with the non-heme/manganese-based catalases in bacteria (which I talked about in a previous post). Regardless of origin, catalases faclitate the breakdown of hydrogen peroxide to water and oxygen. The nearly universal heme-based catalase (found in almost every living thing) comes in large-subunit and small-subunit varieties but is always fairly big (726 amino acids, in E. coli). Manganese catalases, by contrast, are always fairly small: around 276 amino acids. But there's one group of manganese catalases that comes in at 416 to 431 amino acids, more than 50% larger than the "typical" manganese catalase. I wanted to know why. Why are these catalases bigger?

Finding the answer wasn't hard (I got lucky). But the answer only leads to more questions.

It turns out very few organisms manufacture the "big" manganese catalase. The organisms in question belong to just 3 genera: Rhizobium, Bradyrhizobium, and Rhodopseudomonas. (The closely related Mesorhizobium has a Mn catalase, but it's the "normal" size Mn catalase. Meanwhile, its cousin, Sinorhizobium, has no Mn catalase.)

The first 280 or so amino acids of the Mn catalases made by Rhizobium, Bradyrhizobium, and Rhodopseudomonas align quite well with the normal-size Mn catalases made by other organisms, the only difference being the 140-amino-acid trailer on the end of the "long versions." I used the alignment editor in Mega6 (tantamount to Notepad) to Cut the trailer portion out of one of the sequences and Paste it into the BLAST search field at UniProt.org. A search against the protein database revealed something quite interesting: The trailer portion of the Bradyrhizobium Mn catalase is a 45%-identity match for the enteric-bacteria yciF gene (which is quite a good match considering the phylogenetic distance between E. coli and Bradyrhizobium).

The E. coli yciF gene (top) is a 58% match for the Bradyrhizobium yciF gene (middle), which in turn is a 64% match for the trailer portion of katN (manganese catalase gene) of Bradyrhizobium, bottom.

Further investigation left little doubt that the "big" Mn catalases are indeed fusion proteins: yciF fused to the C-terminal end of the Mn catalase.

But: What in the world is yciF? It turns out to be a very widely distributed, highly conserved protein of uncertain function. The protein has been purified and its crystal structure determined at 2.0 Å resolution, but its function is still uncertain. What we do know is that it is stress-inducible (along with other yci-series genes) and seems to bind a metal ligand, probably iron; and it shares structural features with rubrerythrin, a non-heme iron protein implicated in oxidative stress protection in anaerobic bacteria and archaea. In E. coli strains that have Mn catalase, the yciF gene occurs two genes upstream (on the 5' side) of the catalase (katN) gene.

Because yciF is more widely distributed (and more highly conserved) than manganese catalase, and because most Mn catalase producers (including those with the fusion enzyme) have an additional, separate copy of yciF, it seems likely (to me, anyway) that the fusion protein was created by chance in the common ancestor of the Rhizobiales when the original katN gene was laid down by a phage or other mobile genetic element. (In E. coli, katN often occurs near phage genes.) Sinorhizobium lost the combo gene entirely, while Mesorhizobium (which makes a small Mn catalase) either obtained katN on its own or lost the 3' trailer from its fusion protein over time.

Since Bradyrhizobium (and the others) already have a separate yciF gene, it's a mystery why the trailer portion of the fusion gene continues to exist. It might very well provide a favorable enhancement of katN function somehow (maybe exploiting iron in an auxiliary catalytic center). If the trailer's doing nothing useful, it should have disappeared over time. (Maybe it did disappear from other Rhizobiales members, and just hasn't disappeared yet in the three genera that still make the "big" enzyme.) I have a feeling the trailer piece does do something useful. Unfortunately, no one has characterized the Braydrhizobium Mn catalase yet, experimentally. We'll probably have to wait until that happens to find out what the "big" enzyme is capable of.

Tuesday, June 03, 2014

Odd Structures in a Prophage Genome

Contrary to what it looks like, this is not an arrival-gate diagram for an eastern European airport. It's a diagram of secondary structure of a portion of the mRNA for the yoqJ gene in B. subtilis phage SPBc2. Click to enlarge.

The common soil bacterium Bacillus subtilis has long been known to harbor an inducible prophage (virus) called SPBc2. Normally, SPBc2 is dormant in the Bacillus genome, but heat or treatment with mitomycin C will cause the phage to express itself and start a lytic infection cycle. Like many bacterial viruses, phage SPBc2 has a sizable double-stranded DNA genome, consisting (in this case) of 134,416 base pairs encoding 185 genes. It's one of the strangest collection of genes you'll ever want to meet.

The SPBc2 genome is strange, first of all, in its base composition. The following stats come from analysis of the coding regions of the genome:

Base	Abundance
A	0.3670
T	0.2825
G	0.1983
C	0.1506

Notice how the purines, A and G (adenine and guanine) are about 30% more abundant than the pyrimidines, C and T (cyotsine and thymine). This is a stark violation of Chargaff's Second Parity Rule (which says A=T and G=C not only for double-stranded DNA but for a single strand, as well).

When purine abundance is tallied on a per-gene basis, you get the following histogram:

Purine abundance for N=185 protein-coding genes in B. subtilis phage SPBc2. Only 8 genes contain less than 50% purine bases.

All but 8 (out of 185) genes have a message-strand purine content above 50%.

And if you look at the gene sequence data, the genes even look funny to the naked eye, containing, as they do, long runs of bases, with funny repeats, like TTGAAAGGAAAAAAAGACGGCCTAAATAAA. One gets the feeling, when examining the sequence data, that one is looking at tRNA or rRNA data rather than protein-coding sequences. And yet, they really are protein-coding sequences.

But it turns out there's a lot of secondary-structure info in the sequences. (See the picture at the top of this post.) Many of the genes contain self-complementing intragenic regions. I decided to verify this with some custom scripts. First, I had scripts look inside each gene for length-10 nucleotide sequences with a matching reverse-complement sequence further downstream (in the same gene). I expected to find one or two (or ten) hits, based on the fact that any given length-10 sequence of the bases A, G, C, and T can turn up at random in about one in every million base pairs. (Each base is worth about two bits, so a length-10 DNA sequence can encode as much info as a length-20 binary sequence; 2-to-the-20th is a little over a million.) What I found was an astounding 419 complementary length-10 sequences within 89 genes.

When I upped the sequence length to 12, I still found 64 intragenic complement sequences in 38 genes. A length-12 match should happen randomly about once every 16 million base pairs. The SPBc2 genome (as I said) is 134,416 base pairs long.

The high degree of internal complementarity means that when the phage's DNA strands are separated during transcription or replication, they probably assume very particular 3-dimensional structures, and also, the messenger RNA made from the phage's genes are probably rich in secondary structure. Exactly why those structures are needed is anyone's guess. They may protect the virus from restriction nucleases (although frankly this chore is already taken care of by the prophage's own methylases). Or they may attract ribosomes in some special way (many of the genes form tRNA-lookalike structures). They may help with virion packaging. Or they may do nothing special at all. I kind of doubt the latter possibility, though.

Sunday, June 01, 2014

A Gene Heatmap

Lately I've been using the great tools at genomevolution.org plus custom Canvas API scripts to render colorful heatmaps of aligned genes from phylogeneticaly diverse microorganisms. The following graphic is one such.

Glyceraldehyde-3-phosphate dehydrogenase genes of N=134 bacterial species, arranged in order of gene G+C content (high GC at the top). Hot colors are G and C. Cool colors are A and T.

What are we looking at? This is actually a composite rendering of the glyceraldehyde-3-phosphate dehydrogenase genes (DNA sequence info) from 134 bacterial species. Each gene is painted left-to-right (5' to 3') in a strip 4 pixels tall, with hot colors assigned to DNA bases G and C (guanine and cytosine), and cool colors assigned to bases A and T (adenine and thymine). Wherever there's a G or C, it gets painted red or red-orange. Wherever there's A or T, blue or blue-green. Same gene, 134 versions, varying significantly in G+C content. (The gene GC content ranges from a maximum of 69.2% at the top to 29.4% at the bottom.)

Why glyceraldehyde-3-phosphate dehydrogenase (GAPDH)? No real reason, except that it's a fairly universal (indeed, quite ancient) metabolic enzyme, reasonably compact (making possible a rendering that's not super-wide, as it would be for a larger gene), well-delineated genetically (not a fusion protein or an enzyme with multiple isoforms), and probably representative of a good many core metabolic enzymes. This is the enzyme that catalyzes the sixth step of glycolysis (sugar-breakdown). You may recall from Biochem 101 that the breakdown of glucose proceeds by splitting the twice phosphorylated molecule into two 3-carbon pieces. The triose phosphates in turn get phosphorylated by GAPDH before they transfer a phosphate to ADP to yield ATP, the 5-hour energy drink of all cells everywhere.

Alignment of genes was done via ClustalW in MEGA6 freeware. Rendering of the alignment FASTA file took about two seconds, in the browser, using 133 lines of custom JavaScript.

Friday, May 30, 2014

Comparative Genomics Using the Canvas API

An interesting and somewhat mysterious aspect of biodiversity is that the relative proportions of the bases in DNA can take on wildly different values in different organisms even though they're making many of the same proteins. I'm referring to the fact that the G+C (guanine plus cytosine) content of DNA can vary from more than 70% (e.g., Streptomyces species) to less than 30% for certain bacterial endosymbionts (and even for some free-living bacteria, such as Clostridium botulinum). Remarkably, the DNA of the tiny bacterium Buchnera aphidicola (which is distantly related to E. coli but entered into a symbiotic partnership with the aphid around 200 million years ago) has a GC content of only 26%, making its DNA look almost like a two-letter code (A and T, with the occasional G or C).

Many genome-reduced endosymbionts have lost most of their DNA repair enzymes (this is true of mitochondria, incidentally, which are thought to have arisen from a symbiosis with an ancient ancestor of today's Alphabproteobacteria), and this loss of repair capability could well explain much of the GC-to-AT shift seen in endosymbiont genomes. (Left on its own, DNA tends to accumulate 8-oxo-guanine residues, which incorrectly pair with thymine and result in GC-to-TA transversions at DNA replication time.) Whatever the cause(s), GC reduction has left many organisms with severely A- and T-enriched DNA. But the organisms in question are still able to encode perfectly functional proteins, even with a limited nucleic-acid vocabulary.

I decided it might be interesting to try to visualize "GC diversity" by creating a heat map of G and C (in hot colors) and A and T (in cool colors) for certain genes that occur across all bacteria. For example, the following graphic is a heat map of GC and AT usage in the gene for thymidine kinase as it occurs in 61 different organisms with genomic G+C ranging from just over 70% to just under 30%.

Thymidine kinase gene for 61 bacteria of widely varying genomic GC percentages. Guanine and cytosine are represented by hot colors, adenine and thymine by cool colors. Alignments were done with MEGA6 software via ClustalW using a relatively permissive gap-opening penalty of 9 and a gap-extension penalty of 5. The heat map was created with JavaScript using the Canvas API.

To create this map, I obtained DNA sequences for thymidine kinase from 61 organisms (using the excellent online tools at UniProt.org and genomevolution.org), then aligned them in MEGA6 and drew colors (red or red-orange for G or C, blue or blue-green for A or T) corresponding to the bases, using the Canvas API. What you're looking at are 61 rows of data (one row per gene; which is also to say, per organism). Gaps created during alignment are shown in grey.

Several things are apparent from this graphic, aside from the obvious fact that GC usage (indicated by red and orange) tends to be high in the genes for organisms like Brachybacterium faecium (top line, GC 72%) and low for bottom-of-the-chart organisms like Clostridium perfringens B strain ATCC 3626 (28.7% GC) and Ureaplasma urealyticum (25.9%). First, high-GC genes tend to be somewhat longer. (Indeed, in the upper right you can see that the longest genes opened up a sizable alignment gap that extends down the whole graphic.) Also, the genes differ substantially in their leader and trailer sequences, although I think what we're really seeing here is (at least in part) inaccurate annotation of start and stop codons. What's interesting to me is the way certain GC regions trail all the way down to the bottom of the graph while others fade to blue. I think it could be argued that the nucleotides represented by the red blips in the final few lines of the graph, at the bottom, are positions in the gene that are under strong functional constraints. It would be interesting to test those positions for evidence of selection pressure. It could be argued that all areas of low selection pressure have turned blue by the time you get to the bottom of the graph.

I think it's also interesting that several red-orange clusters just to the right of the midpoint, about three-quarters of the way down the graph, all but disappear just as a new red-orange zone begins to appear on the left around 80% of the way down. It's as if certain GC-to-AT mutations in one protein domain led to AT-to-GC mutations in another domain upstream. (The 5' side of the graph is on the left, 3' on the right.)

Getting this many genes (from such divergent sources) to align is not easy. You can do it, though, by setting the gap-open penalty in ClustalW to a low value and aligning genes in small batches, row by row, if need be.

As a technical aside: I first tried creating this graph using SVG (Scalable Vector Graphics), but the burden of creating a separate DOM node for every pixel (which is what it amounts to) was way too much for the browser to handle (Firefox choked, as did Chrome), so I quickly switched to the Canvas API, which puts no heavy DOM burdens on the browser and can convert a FASTA-formatted alignment file to a nice picture in about two seconds.

For what it's worth, here are the names of the organisms whose genes appeared in the above graphic, arranged in order of GC content of the kinase gene only (not whole-genome GC). High-GC organisms are listed first:

Blastococcus saxobsidens strain DD2
Geodermatophilus obscurus strain DSM 43160
Streptomyces cf. griseus strain XylebKG-1
Streptosporangium roseum strain DSM 43021
Kribbella flavida strain DSM 17836
Brachybacterium faecium strain DSM 4810
Deinococcus radiodurans strain R1
Gordonia bronchialis strain DSM 43247
Mesorhizobium australicum strain WSM2073
Turneriella parva strain DSM 21527
Mesorhizobium ciceri biovar biserrulae strain WSM1271
Propionibacterium acnes TypeIA2 strain P.acn33
Novosphingobium aromaticivorans strain DSM 12444
Geobacillus kaustophilus strain HTA426
Geobacillus thermoleovorans strain CCB_US3_UF5
Halogeometricum borinquense DSM 11551
Alistipes finegoldii strain DSM 17242
Agrobacterium radiobacter strain K84
Rhizobium tropici strain CIAT 899
Aeromonas hydrophila strain ML09-119
Rhodopirellula baltica SH strain 1
Porphyromonas gingivalis strain ATCC 33277
Dyadobacter fermentans strain DSM 18053
Enterobacter cloacae strain SCF1
Paenibacillus polymyxa strain M1
Parabacteroides distasonis strain ATCC 8503
Bacillus amyloliquefaciens strain Y2
Vibrio cholerae strain BX 330286
Erwinia amylovora strain ATCC 49946
Bacillus subtilis BEST7613 strain PCC 6803
Gramella forsetii strain KT0803
Prevotella copri strain DSM 18205
Anaerolinea thermophila strain UNI-1
Klebsiella oxytoca strain 10-5243
Bacteroides ovatus strain 3_8_47FAA
Aggregatibacter phage S1249
Escherichia coli B strain REL606
Shigella boydii strain Sb227
Psychroflexus torquis strain ATCC 700755
Bacteroides dorei strain 5_1_36/D4
Leuconostoc gasicomitatum LMG 18811 strain type LMG 18811
Mycoplasma gallisepticum strain F
Myroides odoratimimus strain CIP 101113
Oceanobacillus kimchii strain X50
Chryseobacterium gleum strain ATCC 35910
Carboxydothermus hydrogenoformans strain Z-2901
Yersinia pestis D106004
Bacillus thuringiensis serovar andalousiensis strain BGSC 4AW1
Bacillus anthracis strain CDC 684
Coprobacillus sp. strain 8_2_54BFAA
Proteus mirabilis strain HI4320
Staphylococcus aureus strain 04-02981
Aerococcus urinae strain ACS-120-V-Col10a
Lactobacillus acidophilus strain 30SC
Lactobacillus reuteri strain MM4-1A
Lactococcus lactis subsp. cremoris strain A76
Haemophilus ducreyi strain 35000HP
Ureaplasma urealyticum serovar 5 strain ATCC 27817
Clostridium botulinum A strain Hall
Streptococcus agalactiae strain 2603V/R
Clostridium perfringens B strain ATCC 3626

Wednesday, May 14, 2014

Where Do Virus Genes Come From?

There's a memorable moment in War of the Worlds (Spielberg version) when the Tom Cruise character tries to explain to his son, while driving a minivan, that they have to leave town immediately because evil, marauding machines "from somewhere else" are on the rampage. Robbie (the son) says: "What, you mean, like, from Europe?"

That's the image that comes to mind when I try to explain where virus genes come from. They don't come from the mother ship (the host). Oh sure, in some cases they clearly do derive from host genes. But in most cases, they clearly don't. The overwhelming majority of viral genes have no counterparts in host cells, and even for those that do, the genes in question are rarely true host orthologues.

"No, Robbie, not like Europe."

Where do they come from, then?

Short answer: Somewhere else.

Using a program like the excellent (and free) Mega6 ("Molecular Evolutionary Genetic Analysis") you can easily create phylogenetic trees from genetic data (FASTA sequences, protein or DNA), and when you do this for genes that occur in both viruses and host cells, you can see how they separate in phylo-space.

Example: I decided to look at the gene for thymidine kinase, which occurs in most living things plus a certain number of undead, maybe not-quite-living (in the usual sense) things known as viruses. Thymine (T) is, of course, an essential ingredient of DNA. Thymidine kinase converts thymidine (thymine bonded to deoxyribose, on the left, below) to the phosphorylated form (TMP, right) so it can participate in DNA synthesis.

It's important to note that thymidine (above, left) is not a synthesis product but a breakdown product. The biosynthetic pathway for TMP actually starts with dUMP, which is converted to TMP via thymidylate synthetase, an entirely different enzyme. Where does thymidine come from, then, and why does a cell need thymidine kinase? The answer is that thymidine is a breakdown product of DNA. When you eat plant or animals cells, the DNA in those cells gets broken down, and thymidine is one of the breakdown products. For cells, thymidine is a valuable product to have, so thymidine kinase recovers free thymidine via the reaction shown above. This is a scavenging pathway, in other words. Most organisms have it, but some don't (e.g., Pseudomonas lacks it).

When a virus attacks a cell, there's a lot of DNA turnover as the virus prepares to manufacture its own DNA, so thymidine kinase is a handy enzyme to have around if you're a virus. Many viruses, as a result, bring their own copy of the TK enzyme gene. But is the viral TK gene derived (in some kind of ancestral way) from the host cell's own TK gene? Not necessarily.

When you gather up the amino acid sequence data for a bunch of TK genes from bacteria and the phages (viruses) associated with them, you find that, phylogenetically speaking, the phage/viral TK genes are not very similar to the host genes.

Thymidine kinase phylo tree for enteric bacteria and their phages. (Click to enlarge.) Note that the phage genes (top cluster) segregate from the host genes in the lowermost branch. Interestingly, TK genes of various alphaproteobacteria of the Rhizobiales class cluster near the phage genes (much nearer than the enteric bacteria). This suggests a primordial origin of the phage genes rather than recent acquisition of TK from host cells.

In the above tree, phage (viral) genes cluster at the top. Host-cell kinases cluster at the bottom. The two clusters may be related to a distant ancestor (not shown), but one thing is certain: the phage versions of this gene are not simply a slight modification of the host gene. We know that's true because, remarkably, the thyK genes of certain alphaproteobacteria (Agrobacterium and its relatives) cluster with the phage genes, even though the bacteriophages are adapted to E. coli and Salmonella (and closely related enterics). In theory, the phage genes should cluster with the enteric bacteria, not with Agrobacterium and Rhizobium.

Where do the phage genes come from? Some have speculated that these phages originated with escaped bacterial secretion-system cassettes. That may well be, but the escape event had to have occurred many hundreds of millions of years ago. The T4 thymidine kinase gene has only 62% sequence homology with the E. coli gene. T4's version of the gene has a G+C content of 34%, with GC3 (third codon base) of just 19%. The E. coli version of the gene has overall G+C of 42% and GC3 of 36%. While it's generally conceded that evolution of viral genes occurs faster than host genes (particularly for RNA viruses and single-stranded DNA viruses), DNA viruses like T4 can't evolve outside of the host cell, as far as we know, and although DNA viruses may evolve faster than host DNA, they don't evolve thousands of times faster. (RNA viruses do evolve thousands of times faster, but that's an entirely different matter.)

To me, the above tree says that enteric phages diverged from the common ancestor of alphaproteobacteria and gammaproteobacteria (the latter include the enteric bacteria). We're talking hundreds of millions of years ago. (Note that E. coli and its closest relatives are thought to have diverged 140 million years ago.) A recent transfer of thymidine kinase genes from enteric bacteria to their phages is not credible. It's far more likely that an ancient precursor of today's T4 phage (and similar enteric phages) had the gene, and passed it down through the ages as the phages adapted to new hosts (first the alphaproteobacteria, then the phylogenetically newer enterics).

In tomorrow's post, I want to explain in detail how the above phylo tree was made and show you how you can make your own phylogenetic trees using the popular Mega6 program. If you've never used Mega6, you're in for a treat.

Thursday, August 15, 2013

Converting an SVG Graph to Histograms

The graphs you get from ZunZun.com (the free graphing service) are pretty neat, but one shortcoming of ZunZun is that it won't generate histograms. (Google Charts will do histograms, but unlike ZunZun, Google won't give you SVG output.) The answer? Convert a ZunZun graph to histograms yourself. It's only SVG, after all. It's XML; it's text. You just need to edit it.

Of course, nobody wants to hand-edit a zillion <use> elements (to convert data points to histogram rects). It makes more sense to do the job programmatically, with a little JavaScript.

In my case, I had a graph of dinucleotide frequencies for Clostridium botulinum coding regions. What that means is, I tallied the frequency of occurrence (in every protein-coding gene) of 5'-CpG-3', CpC, CpA, CpT, ApG, ApA, ApC, and all other dinucleotide combinations (16 in all). Since I already knew the frequency of G (by itself), A, C, and T, it was an easy matter to calculate the expected frequency of occurrence of each dinucleotide pair. (For example, A occurs with frequency 0.403, whereas G occurs with frequency 0.183. Therefore the expected frequency of occurrence of the sequence AG is 0.403 times 0.183, or 0.0738.) Bottom line, I had 16 expected frequencies and 16 actual frequencies, for 16 dinucleotide combos. I wanted side-by-side histograms of the frequencies.

First, I went to ZunZun and entered my raw data in the ZunZun form. Just so you know, this is what the raw data looked like:

0 0.16222793723642806
1 0.11352236777965981
2 0.07364933857345456
3 0.08166221769088752
4 0.123186555838253
5 0.12107590293804558
6 0.043711462078314355
7 0.03558766171971166
8 0.07364933857345456
9 0.07262685957145093
10 0.033435825941632816
11 0.03459042802303202
12 0.055925067612781175
13 0.042792101322514244
14 0.019844425842971265
15 0.02730405457750352
16 0.123186555838253
17 0.12232085101526233
18 0.055925067612781175
19 0.05502001002972254
20 0.09354077847378013
21 0.07321410524577443
22 0.03319196776961071
23 0.028600012050969865
24 0.043711462078314355
25 0.043328337600588136
26 0.019844425842971265
27 0.0062116692282947845
28 0.03319196776961071
29 0.04195172151930211
30 0.011777822917388797
31 0.015269662767317132

I made ZunZun graph the data, and it gave me back a graph that looked like this:

Which is fine except it's not a histogram plot. And it has goofy numbers on the x-axis.

I clicked the SVG link under the graph and saved an SVG copy to my local drive, then opened the file in Wordpad.

The first thing I did was locate my data points. That's easy: ZunZun plots points as a series of <use> elements. The elements are nested under a <g> element that looks like this:

<g clip-path="url(#p0c8061f7fd)">

I hand-edited this element to have an id attribute with value "DATA":

<g id="DATA" clip-path="url(#p0c8061f7fd)">

Next, I scrolled up to the very top of the file and found the first <defs> tag. Under it, I placed the following empty code block:

<script type="text/ecmascript"><![CDATA[
// code goes here

]]></script>

Then I went to work writing code (to go inside the above block) that would find the <use> elements, get their x,y values, and create <rect> elements of a height that would extend to the x-axis line.

The code I came up with looks like this:

// What is the SVG y-value of the x-axis?

// Attempt to discover by introspecting clipPath

function findGraphVerticalExtent( ) { 

   var cp = document.getElementsByTagName('clipPath')[0];

   var rect = cp.childNodes[1];

   var top = rect.getAttribute('y') * 1;

   var bottom = rect.getAttribute('height') * 1;

   return top + bottom; 

}

// This is for use with SVG graphs produced by ZunZun,

// in which data points are described in a series of

// <use> elements. We need to get the list of <use>

// nodes, convert it to a JS array, sort data points by

// x-value, and replace <use> with <rect> elements.

function changeToHistograms( ) {

   var GRAPH_VERTICAL_EXTENT = findGraphVerticalExtent( );

   // The 'g' element that encloses the 'use' elements

   // needs to have an id of "DATA" for this to work!

   // Manually edit the <g> node's id first! 

   var data = document.getElementById( "DATA" );

   // NOTE: The following line gets a NodeList object,

   // which is NOT the same as a JavaScript array!

   var nodes = data.getElementsByTagName( "use" );

   // utility routine (an inner method)

   function nodeListToJavaScriptArray( nodes ) {

       var results = [];

       for (var i = 0; i < nodes.length; i++)

          results.push( nodes[i] );

       return results;

   }

   // utility routine (another inner method)

   function compareX( a,b ) { 

       return a.getAttribute("x") * 1 - b.getAttribute("x") * 1; 

   }

   var use = nodeListToJavaScriptArray( nodes );

   // We want the nodes in x-sorted order

   use.sort( compareX );  // presto, done

   // Main loop

   for (var i = 0; i < use.length; i++) {

       var rect = 

           document.createElementNS("http://www.w3.org/2000/svg", "rect");

       var item = use[i];

       var x = item.getAttribute( "x" ) * 1;

       var y = item.getAttribute( "y" ) * 1;

       var rectWidth = 8;

       var rectHeight = GRAPH_VERTICAL_EXTENT - y;

       rect.setAttribute( "width", rectWidth.toString() );

       rect.setAttribute( "height", rectHeight.toString() );   

       rect.setAttribute( "x" , x.toString() );

       rect.setAttribute( "y" , y.toString() );

       // We will alternate colors, pink/purple

       rect.setAttribute( "style" , 

           (i%2==0)? "fill:ce8877;stroke:none" : "fill:8877dd;stroke:none" );

       data.appendChild( rect ); // add a new rect

       item.remove();  // delete the old <use> element

   }

   return use;

}

As so often happens, I ended up writing more code than I thought it would take. The above code works fine for converting data points to histogram bars (as long as you remember to give that <g> element the id attribute of "DATA" as mentioned earlier). But you need to trigger the code somehow. Answer: insert onload="changeToHistograms( )" in the <svg> element at the very top of the file.

But I wasn't done, because I also wanted to apply data labels to the histogram bars (labels like "CG," "AG," "CC," etc.) and get rid of the goofy numbers on the x-axis.

This is the function I came up with to apply the labels:

   function applyLabels( sortedNodes ) {
 
    var labels = ["aa", "ag", "at", "ac", 
      "ga", "gg", "gt", "gc", "ta", "tg", 
      "tt", "tc", "ca", "cg", "ct", "cc"];

      var data = document.getElementById( "DATA" );
 var labelIndex = 0;

 for (var i = 0; i < sortedNodes.length; i+=2) {
     var text = 
              document.createElementNS("http://www.w3.org/2000/svg", "text");
     var node = sortedNodes[i];
          text.setAttribute( "x", String( node.getAttribute("x")*1 +2) );
          text.setAttribute( "y", String( node.getAttribute("y")*1 - 13 ) );
          text.setAttribute( "style", "font-size:9pt" );
          text.textContent = labels[ labelIndex++ ].toUpperCase();
          text.setAttribute( "id", "label_" + labelIndex );
          data.appendChild( text );
      }
   }

And here's a utility function that can strip numbers off the x-axis:

   // Optional. Call this to remove ZunZun graph labels.
   // pass [1,2,3,4,5,6,7,8,9] to remove x-axis labels
   function removeZunZunLabels( indexes ) {
 
 for (var i = 0;i < indexes.length;i++) 
    try {
   document.getElementById("text_"+indexes[i]).remove();
   }
  catch(e) { console.log("Index " + i + " not found; skipped.");
   }
   }

BTW, if you're wondering why I multiply so many things by one, it's because the attribute values that comprise x and y values in SVG are String objects. If you add them, you're concatenating strings, which is not what you want. To convert a number in string form to an actual JavaScript number (so you can add numbers and not concatenate strings), you can either multiply by one or explicitly coerce a string to a number by doing Number( x ).

The final result of all this looks like:

Final graph after surgery. Expected (pink) and actual (purple) frequencies of occurrence of various dinucleotide sequences in C. botulinum coding-region DNA.

Which is approximately what I wanted to see. The labels could be positioned better, but you get the idea.

What does the graph show? Well first of all, you have to realize that the DNA of C. botulinum is extremely rich in adenine and thymine (A and T): Those two bases constitute 72% of the DNA. Therefore it's absolutely no surprise that the highest bars are those that contain A and/or T. What's perhaps interesting is that the most abundant base (A), which should form 'AA' sequences at a high rate, doesn't. (Compare the first bar on the left to the shorter purple bar beside it.) This is especially surprising when you consider that AAA, GAA, and AAT are by far the most-used codons in C. botulinum. In other words, 'AA' occurs a lot, in codons. But even so, it doesn't occur as much as one would expect.

It's also interesting to compare GC with CG. (Non-biologists, note that these two pairs are not equivalent, because DNA has a built-in reading direction. The notation GC, or equivalently, GpC, means there's a guanine sitting on the 5' side of cytosine. The notation CG means there's a guanine on the 3' side of cytosine. The 5' and 3' numbers refer to deoxyribose carbon positions.) The GC combo occurs more often than predicted by chance whereas the combination CG (or CpG, as it's also written) occurs much less frequently than predicted by chance. The reasons for this are fairly technical. Suffice it to say, it's a good prima facie indicator that C. botulinum DNA is heavily methylated. Which in fact it is.

Saturday, August 03, 2013

Do-It-Yourself Bio-Hacking: A Tutorial

Today I want to show you how you can do a slick bio-hacking experiment, and graph the results nicely, all in your browser, in well under 10 minutes. The following experiment will run just fine in Chrome or Firefox. In Firefox, it helps to have the Firebug extension. (If you're using Firefox, click F12. If it pops a console window, you already have Firebug.) I tested against Chrome v28.0.1500.72 and Firefox 15.0.1 with Firebug 1.9.2. Other combinations may work; those are just the ones I tested.

We're going to do a comparative genomics/proteomics experiment designed to explore amino-acid usage in a particular protein (DnaJ) across a couple dozen bacterial species. Even if you're not a bio-geek, I hope you'll follow along. At the very least, you'll learn how to make pretty graphs from any kind of data using the server at ZunZun.

What is the DnaJ protein, you ask? It's one of a class of proteins known as heat shock proteins, which are produced in response to elevated temperatures. (Your body produces heat shock proteins in response to fever, for example.) As you probably know (or can guess), proteins, in general, are rather sensitive to heat. Even a small amount of heat can cause a protein to start to unravel (or denature). DnaJ and its partners have the job of helping proteins re-fold into their correct original 3D shape(s) after exposure to heat. They're like little repair jigs. A partially damaged protein goes in; it re-folds and comes back out good as new.

Heat shock proteins occur widely, across all domains of life, and their amino-acid sequences are highly conserved; but they do differ. As we'll see right now.

Step 1
Go to http://www.uniprot.org/ and enter "DnaJ" (case doesn't matter) in the search field at the top of the page, then hit Enter. A list of organisms with DnaJ will appear, each with a checkbox on the left. Check all the checkboxes on the page (gang-check them with Shift-click).

Step 2
You'll notice at the bottom of the window there's a green bar with buttons "Retrieve," "Align," "Blast," and "Clear." Click the Retrieve button.

Steps 1 and 2.

Step 3
In the page that comes up, look for FASTA on the left. Under it are two links, Download and Open. Click Open. (See screen shot below.) You'll see a bunch of protein sequences (with one-letter abbreviations for amino acids), each preceded by a line that begins with > (greater-than sign). These are our DnaJ proteins.

Step 3.

Step 4
Click F12 to toggle open the console window. Be sure the Console tab is showing. In Firebug, you may also have to click the Console menu and choose Command Editor from the dropdown list.

Enter and execute (with Enter, in Chrome, or with Control-Enter in Firebug) the following lines of code:

all = document.getElementsByTagName("pre")[0].innerHTML.split(/&gt;/);

all.shift();

It's important that the part between slashes be ampersand-g-t-semicolon, not a greater-than symbol. The browser is showing you greater-than signs but in the HTML markup it's really ampersand-g-t-semicolon, not angle brackets. We actually do want to split on >, not on >.

Note that to execute a line of code in Firebug you have to type Control-Enter. In Chrome, you just type Enter. But in Chrome's JavaScript console, you have to use Shift-Enter to type on more than one line.

The variable all now contains an array of protein sequences. If you want to verify it, type all.length (then Enter, or in Firebug Control-Enter), and you should see the length of the array, 25.

Step 5
Enter the following code in the console (and execute it with Enter; it'll do nothing, which is fine).

function analyze( item ) {

   var sequence = item.split(/SV=\d\n(?=\w)/)[1];
   var lysineCount = sequence.match(/K/g).length;
   var arginineCount = sequence.match(/R/g).length;
   lysineCount /= sequence.length;
   arginineCount /= sequence.length;
   console.log( lysineCount + " " + arginineCount );

}

This is the callback code we'll use to process every member of the all array. Each item in the array consists of a FASTA header followed by a protein sequence. We just want the sequence, not the header, which is why we have a first line that splits off the part we need. The remaining lines obtain the number of lysines (K) and the number of arginines (R) in the protein sequence, then we divide those numbers by the sequence length to get a frequency-of-occurrence. The final line prints the results to the console window.

This function, by itself, doesn't do anything until we run it against each amino-acid sequence in the all array. That's the next step.

Step 6
Enter the following line of code into the console and run it with Enter (or Control-Enter, in Firebug):

all.forEach( analyze );

The console should immediately fill with numbers (25 rows of two numbers each). That's our data. We need to graph it to see what it looks like. Ready?

Step 7
Go now to http://zunzun.com and notice four pulldown menus at the top of the page. Use the far-left dropdown to select Polynomial.

Select Polynomial from the ZunZun function list.

A new window appears with ugly (or beautiful, depending on your mindset) formulas. Click the link to First-Order (Linear) 2D. Why? Because in the absence of any foreknowledge, we're going to blindly assume that our data is best fit by a straight line. If it's not straight-line data, we can come back and change our selection later.

When you click the First-Order (Linear) 2D link, you'll quickly be in a stark-looking window with a single pulldown menu at the top. Click it and select Data Labels for Graphs. Replace "X data" with "Lysine" and "Y data" with "Arginine."

Select Data Labels for Graph.

Step 8
Now use the single pulldown menu to select Text Data Editor.

Quickly go back to that console window and Copy all of your data (all 25 rows of numbers), then Paste the data into the Text Data Editor box.

Click the Submit button near the top of the page. Be patient, as it may take up to 20 seconds or so for your graph to be ready.

You'll know your graph is ready when the window changes to one that shows four pulldown menus at the top. The far-right menu is Data Graphs. Click into it and select Arginine vs. Lysine with model. NOTE: The exact names of the menu items will depend on how you labeled your axes at the end of Step 7 above.

You should see the window change to a view of a graph that looks like this:

Graph created on demand by the ZunZun server.

Pretty easy, right? It gets better. The line that ZunZun drew through the data points is a regression curve that minimizes the sum of squared error. To see the formula for the line, including coefficients, use the far-left menu, called Coefficients and Text Reports, to select Coefficients. Don't worry, your graph will still be there when you're done. To get back to the graph at any time, just use the far-right menu and any of the commands under it (which re-display the graph in various ways).

The graph seems to be saying that Arginine levels go down as Lysine goes up. But how good of a correlation is this, really? Use the far-left pulldown menu again. This time select Coefficient and Fit Statistics. You'll notice a ton of stats (chi-squared and so on). Among them, r-squared is given as 0.637834788057. That means the correlation coefficient, r, is 0.799, which is pretty solid.

I'll save the interpretation of our experiment's results for another time. For now, notice that underneath your ZunZun graph are links for saving the graph either as PNG or SVG. I strongly recommend you save it as both. You can open SVG in both Photoshop and Illustrator (and most browsers too). You will definitely want to keep an SVG version around to edit by hand in your favorite text editor (SVG is just a variety of XML). I'll be showing you how to do lots of sexy things with SVG graphs in upcoming posts.

Friday, July 26, 2013

Getting Started in Desktop Bioinformatics

I've spent about four months now exploring do-it-yourself desktop bioinformatics. Overall, I'm excited by what I've been able to do and I'm optimistic about the prospects for other do-it-yourself desktop-science geeks, because there are tons of great online tools for doing bio-sci and lots of important scientific questions yet to be fleshed out. So I thought I'd share some of what I've learned, and provide some pointers for anyone who might want to try his or her hand at this sort of "citizen science."

It helps to have the benefit of a science education (in particular, a bio education) before beginning, but one of the great things about do-it-yourself desktop science is that you can (and will!) learn as you go. For example, you might have only a bare-bones basic understanding of enzymology before you begin, but as you move deeper into a particular research quest, you'll find yourself wanting to learn more about this or that aspect of an enzyme. So you'll hit Google Scholar and bring yourself up-to-date on this or that detail of a particular subject. That's a Good Thing.

When I first plunged into DNA analysis, I have to admit my knowledge of mitochondria was weak. I knew they had their own DNA, for example, but it wasn't obvious to me (until I started digging) that mitochondrial DNA is pathetically small, whereas the mitochondrial proteome (the superset of all products that go into making up a functioning mitochondrion) is large. In other words, most "mitochondrial genes" are not in mtDNA. They're in nuclear DNA. There are a couple of online databases of nuclear mitochondrial genes (NUMTs, as they're known), but by and large this is an area in dire need of more research. Someone needs to put together a database or reference set of yeast NUMTs, for example. We also need a database for algal NUMTs, another for protozoan NUMTs, another for rice or corn or Arabidopsis NUMTs, etc. Maybe you'll be the one to move such a project along?

So. How can you get started in desktop bioinformatics? I recommend familiarizing yourself with the great tools at genomevolution.org, which is powered by iPlant, which (in turn) is funded by the National Science Foundation Plant Cyberinfrastructure Program here in the U.S. In particular, I recommend you set aside an evening to run through some of the tutorials at genomevolution.org. That'll give you an idea of what's possible with their tools.

Many organisms have genes for flagellum proteins,
but not all such organisms actually make a flagellum.
(The flagellum is the whiplike appendage that gives the
cell motility. Above: Bdellovibrio, a bacterium with a
powerful flagellum.)

If you go to this page and scroll down, you'll find some really interesting short videos showing how to use some of the genomevolution.org tools. They're fun to watch and should stimulate your imagination.

What kinds of problems need investigating by desktop biologists? The sky's the limit. One quest that lends itself to citizen science is looking for examples of horizontal gene transfer (HGT). This requires that you first teach yourself a little bit about BLAST searches. (BLAST searches are sequence-similarity searches that let you compare DNA against DNA or amino-acid sequence against amino-acid sequence.) The strategies involved here can range from simple and brute force to sophisticated; and the great thing is, you can invent your own heuristics. It's a wide-open area. I recently found good evidence (90%+ similarity of DNA sequences) for bacterial gene transfer into rice, which I'll write about in a later post. I'm confident there are thousands of examples of horizontal gene transfer (whether from bacteria to bacteria, bacteria to plant, bacteria to insect, or whatever) waiting to be discovered. You could easily be the next discoverer of one of these gene transfers.

Here are some other ideas for desktop-science explorations:

Find and characterize flagellar genes in organisms that lack motility. If you dig into the literature, you'll find that there are many examples of supposedly immotile organisms (like the intracellular parasite Buchnera, which lives inside aphids) that not only harbor flagellum genes but express some of them—yet have no external flagellum. Obviously, organisms that retain flagellum genes but actually don't make a flagellum (that little whip-like tail that makes single-celled organisms swim around) must be retaining those genes for a reason. The gene products must be doing something. But what? Also: Paramecium and diatoms and other eukaryotes make flagella and/or cilia. Most animals also make cilia. (Ever get a tickle deep in your throat or bronchia? It was probably something tangling with the cilia lining your bronchial system.) What's the relationship between cilia gene products in Paramecium, say, and cilia in animals? Do any plants conceal cilia genes? If so, how are they related phylogenetically to lower-organism cilia?
Migration of genes from parasites to host DNA. A general pattern that seems to happen in nature is: a bacterium or other invader takes up residency inside an animal or plant cell, becoming an endosymbiont; then some of its genes (the symbiont's genes) move to the nucleus of the host cell. Which genes? What do the genes do? That's up to you to try to find out.
Bidirectionally ("bidi") transcribed genes: While rare, there are examples of genes in which each strand of DNA is transcribed into mRNA. (The genome for Rothia mucilaginosa contains many putative examples of this.) Find organisms that contain bidi genes. Try to determine if both strands are actually transcribed. Examine sister-species organisms to see if one strand is transcribed in one organism and the other gene (on the other strand) is transcribed in the other organism.
Phylogenetics of plasmid and viral genes. Try to determine the ancestry of a virus gene. There are good tree-making services online that do all the hard work for you, including protein-sequence alignments. All you have to do is cut and paste Fasta files.
Codon analysis. There are many plants (rice is one) and higher organisms in which DNA is more or less equally divided into high-GC-content genes and low-GC-content genes. Surely the codon usage patterns for each class of gene(s) varies. But how? What are the codon adaptation indexes (CAI values) for the various genes? Create a few histograms of CAI values. Use CAIs and other techniques to try to determine which genes are highly expressed. Are HEGs (highly expressed genes) mostly high-GC? Low-GC? Both? Run some histograms on the genes' purine (A+G) content, G+C content, G+C content by codon position.
Many organisms (and organelles) have extremely GC-poor genomes. Some have bizarre codon usage patterns (where, say, the codon AAA is used 12 times more than the average codon). Some use 56 or fewer codons (out of 64 possible). Find the organelle or organism that uses the fewest codons. See if there's an organism or organelle that uses fewer than 20 amino acids. Which amino acid(s) get(s) left out?
Characterize the DNA repairosome of an aerobic and an anerobic archeon. Compare and contrast the two.
Find all the genes in a particular organism that have mitochondrial-targeting presequences in their DNA.
Pick two closely related organisms. Try to figure out how many million years ago they diverged. Use mitochondrial DNA analysis as well as cytoplasmic protein analysis.
Find the bacterium that has more secretion-protein, permease, and protein-translocation genes than any other. Compare it to its closest relative.
Find an organism that is pathogenic (to humans, animals, or plants). Find its closest non-pathogenic relative. Compare genomes. Determine which genes are most likely to be involved in virulence.
Some seemingly simple organisms (amoebae) have more DNA than a human. Why? What's all that DNA doing there? Does it contain horizontally transferred genes from plants, bacteria, archeons, animals? Are amoebas and other super-large-genome organisms "DNA hoarders"? Are they DNA curationists? Characterize the genes (enumerate by category, first) of these organisms. How many are expressed? How many are junk? What's the energy cost of maintaining that much junk DNA? Can it all be junk? Is an amoeba actually a devolved higher life form that forgot how to do morphogenesis and can no longer develop into a tadpole or whatever?
[ insert your own project here! ]

Sunday, July 21, 2013

Why Mitochondrial DNA is Different

Most genomes that are high in A+T content (or low in G+C content) show a surprising DNA strand asymmetry: The message strand of genes tends to be rich in purines. This rule applies across all domains I've looked at except mitochondria, where message strands tend to be pyrimidine-rich rather than purine-rich. The following two graphs makes this clearer.

This is a graph of message-strand (or RNA-synonymous-strand) purine content plotted vertically, against A+T plotted horizontally, for 1,373 bacterial species. Each dot represents a genome. High-GC/low-AT organisms like Streptomyces and Bordetella are on left and low-GC/high-AT organisms like Clostridium botulinum are toward the right. The few dots on the far right are intracellular endosymbionts that have lost a good bit of DNA over the millennia. They tend to be extremely high in A+T.

Compare the above graph with the graph below, which is the same thing (message-strand A+G vs. A+T) for mitochondrial DNA (N=2543 genomes). There is still an upward slope to the data (and in fact it is steeper than it looks, because the range of y-values is different in the graph below than in the graph above). The slope of the regression line is very nearly the same (0.148 vs. 0.149) for both graphs. But you can see that in the graph below, nearly all the points are below y = 0.50. That means message-strands are high in pyrimidines rather than purines.

I speculated in a previous post that the reason mitochondrial DNA is pyrimidine-heavy on the message strand is that mtDNA encodes a very small number of proteins (13, in all), and they tend to be membrane-associated proteins, which use mostly non-polar amino acids. It turns out that codons for the non-polar amino acids are pyrimidine-rich.

To see if that's really what's going on, I obtained the DNA sequences for cytochrome-c oxidase and NADH dehydrogenase (the two must fundamental enzyme systems of mitochondria) from several hundred bacterial species. Actually, I was able to obtain DNA sequences for a total of 942 bacterial NADH dehydrogenase (subunit L) proteins. I also succeeded in obtaining DNA sequences for 647 bacterial cytochrome-c oxidase subunit 1 proteins. In mitochondria, these genes are known as ND5 and Cox1. In bacteria they're better known as nuoL and cyoB.

The graph below shows A+G for the two enzymes versus whole-chromosome A+T, for the relevant organisms.

Message strand purine content was derived from the DNA sequences of cyoB (pink) genes from 942 bacteria, and from nuoL (blue) genes from 647 bacterial species. The A+G values were plotted against host-organism whole-genome A+T content. All cyoB and nuoL sequences tended to be pyrimidine rich. But pyrimidine content was less for organisms with high A+T content. (Note the slightly positive slope of the regression line.)

The pink points are for cytochrome-c oxidase subunit 1 (cyoB) while the blue points are for NADH dehydrogenase subunit 5 (nuoL). Two things are worth noting. One is that the regression line is upward-sloping, meaning that as an organism's DNA gets richer in A+T content, the purine content on the message strand rises. This effect seems to be universal. The second thing to note is that almost all of the points in the graph lie below y = 0.5, as is the case for mitochondria. These two signature "mitochondrial" enzyme systems, critical to oxidative phosphorylation (in bacteria as well as higher organisms), do tend to use pyrimidine-rich codons—rendering the relevant genes pyrminidine-rich on the RNA-synonymous (message) strand of DNA. The hypothesis is upheld.

For you bio students, a bit of homework: You might want to think about why it is that membrane-associated proteins are rich in non-polar amino acids. (In human mitochondria, leucine and isoleucine are the most-used amino acids. Together they account for an amazing 30% of all amino acids used in mtDNA-encoded gene products.) Hint: Most membranes have a lipid bilayer, and lipids don't like water.

Saturday, July 20, 2013

More about Mitochondrial DNA

To recap my desktop-science experiments of the last month or so, I've found strandwise DNA asymmetry across domains, which is to say in bacteria, Archaea, eukaryotes, viruses, and mitochondrial DNA. In every case except mitochondria, the message (or RNA-synonymous) strand of DNA in coding regions tends to be purine-rich. The opposite strand tends to be pyrimidine-rich. Moreover, in all domains, including mitochondria, message-strand purine content increases in proportion to genome A+T content. (A+T content is a phylogenetic signature. Some genomes are inherently high in A+T content—or low in G+C content—while others are not. Related organisms tend to have similar A+T or G+C contents.)

Mitochondrial genes tend to be pyrmidine-rich on the message strand, seemingly in violation of the finding that in all other domains, message strands are purine-rich. The mitochondrial anomaly is actually very easy to understand (although it took me weeks to realize the explanation). In a nutshell: Mitochondrial DNA is pyrimidine-rich on message strands because mtDNA encodes only a few proteins (13, usually), all of them membrane-associated. Membrane-associated proteins are unusual because they tend to incorporate mostly non-polar amino acids such as leucine, isoleucine, valine, proline, alanine, or phenylalanine—all of which are specified by pyrimidine-rich codons.

The mitochondrion.

It seems to me mitochondrial DNA shouldn't be thought of as a genome, because well over 90% of mitochondrial-associated gene products are encoded by genes in the host nucleus. (In humans, there may be as many as 1500 nuclear-encoded mitochondrial genes.) This point is worth repeating, so let me quote Patrick Chinnery, TRENDS in Genetics (2003) 19:2, 60:

The vast majority of mitochondrial proteins (estimated at >1000) are synthesized in the cytosol from nuclear gene transcripts.

The circular mitochondrial "chromosome" (if it can be called that) is the vestigial remnant of a much larger genome that long ago migrated to the host nucleus, no doubt to avoid oxidative attack. The mitochondrion simply is not a safe place to store DNA. (Would you set up a sperm bank in a rocket-fuel factory?) It's teeming with molecular oxygen, superoxides, peroxides, free protons, and other hazardous materials.

The human mitochondrial chromosome.

Human mitochondrial DNA (which is typical of a lot of mtDNA) encodes just a handful of multi-subnit transmembrane proteins, namely: cytochrome-c oxidase, NADH dehydrogenase, cytochrome-b, and an ATPase. That's it. There are no other protein genes in human mtDNA. All other "mitochondrial proteins" are encoded somewhere else. (That includes 37 out of 44 subunits of the NADH dehydrogenase complex; the DNA polymerase that replicates mitochondrial DNA; the mitochondrial RNA polymerase; about 50 ribosomal proteins; so-called "mitochondrial" catalase; and hundreds of other "mitochondrial" proteins. All are encoded in the nucleus.)

Bottom line: Mitochondrial DNA encodes a very small ensemble of highly specialized membrane-associated proteins. We shouldn't expect this small ensemble to be representative of other genes found in other genomes. (And it's not.) That, in a nutshell, is why mtDNA is not particularly purine-rich in message strands.

But we should test this hypothesis, if possible. (And it is, in fact, possible.) Most bacteria are aerobic, which means most bacterial species have genes for cytochrome-c oxidase, NADH dehydrogenase, etc. The DNA for those genes should be similar to mtDNA with respect to strand-asymmetric purine content. If we analyze bacterial DNA, we should find that genes for cytochrome-c oxidase, NADH dehydrogenase, etc. are pyrimidine-rich on the message strand, just as in mtDNA.

In tomorrow's post: the data.