Showing posts with label bioinformatics. Show all posts
Showing posts with label bioinformatics. Show all posts

Tuesday, June 03, 2014

Odd Structures in a Prophage Genome


Contrary to what it looks like, this is not an arrival-gate diagram for an eastern European airport. It's a diagram of secondary structure of a portion of the mRNA for the yoqJ gene in B. subtilis phage SPBc2. Click to enlarge.
The common soil bacterium Bacillus subtilis has long been known to harbor an inducible prophage (virus) called SPBc2. Normally, SPBc2 is dormant in the Bacillus genome, but heat or treatment with mitomycin C will cause the phage to express itself and start a lytic infection cycle. Like many bacterial viruses, phage SPBc2 has a sizable double-stranded DNA genome, consisting (in this case) of 134,416 base pairs encoding 185 genes. It's one of the strangest collection of genes you'll ever want to meet.

The SPBc2 genome is strange, first of all, in its base composition. The following stats come from analysis of the coding regions of the genome:

Base Abundance
A 0.3670
T 0.2825
G 0.1983
C 0.1506

Notice how the purines, A and G (adenine and guanine) are about 30% more abundant than the pyrimidines, C and T (cyotsine and thymine). This is a stark violation of Chargaff's Second Parity Rule (which says A=T and G=C not only for double-stranded DNA but for a single strand, as well).

When purine abundance is tallied on a per-gene basis, you get the following histogram:

Purine abundance for N=185 protein-coding genes in B. subtilis phage SPBc2. Only 8 genes contain less than 50% purine bases.
All but 8 (out of 185) genes have a message-strand purine content above 50%.

And if you look at the gene sequence data, the genes even look funny to the naked eye, containing, as they do, long runs of bases, with funny repeats, like TTGAAAGGAAAAAAAGACGGCCTAAATAAA. One gets the feeling, when  examining the sequence data, that one is looking at tRNA or rRNA data rather than protein-coding sequences. And yet, they really are protein-coding sequences.

But it turns out there's a lot of secondary-structure info in the sequences. (See the picture at the top of this post.) Many of the genes contain self-complementing intragenic regions. I decided to verify this with some custom scripts. First, I had scripts look inside each gene for length-10 nucleotide sequences with a matching reverse-complement sequence further downstream (in the same gene). I expected to find one or two (or ten) hits, based on the fact that any given length-10 sequence of the bases A, G, C, and T can turn up at random in about one in every million base pairs. (Each base is worth about two bits, so a length-10 DNA sequence can encode as much info as a length-20 binary sequence; 2-to-the-20th is a little over a million.) What I found was an astounding 419 complementary length-10 sequences within 89 genes.

When I upped the sequence length to 12, I still found 64 intragenic complement sequences in 38 genes. A length-12 match should happen randomly about once every 16 million base pairs. The SPBc2 genome (as I said) is 134,416 base pairs long.

The high degree of internal complementarity means that when the phage's DNA strands are separated during transcription or replication, they probably assume very particular 3-dimensional structures, and also, the messenger RNA made from the phage's genes are probably rich in secondary structure. Exactly why those structures are needed is anyone's guess. They may protect the virus from restriction nucleases (although frankly this chore is already taken care of by the prophage's own methylases). Or they may attract ribosomes in some special way (many of the genes form tRNA-lookalike structures). They may help with virion packaging. Or they may do nothing special at all. I kind of doubt the latter possibility, though.

Sunday, June 01, 2014

A Gene Heatmap

Lately I've been using the great tools at genomevolution.org plus custom Canvas API scripts to render colorful heatmaps of aligned genes from phylogeneticaly diverse microorganisms. The following graphic is one such.
Glyceraldehyde-3-phosphate dehydrogenase genes of N=134 bacterial species, arranged in order of gene G+C content (high GC at the top). Hot colors are G and C. Cool colors are A and T.

What are we looking at? This is actually a composite rendering of the glyceraldehyde-3-phosphate dehydrogenase genes (DNA sequence info) from 134 bacterial species. Each gene is painted left-to-right (5' to 3') in a strip 4 pixels tall, with hot colors assigned to DNA bases G and C (guanine and cytosine), and cool colors assigned to bases A and T (adenine and thymine). Wherever there's a G or C, it gets painted red or red-orange. Wherever there's A or T, blue or blue-green. Same gene, 134 versions, varying significantly in G+C content. (The gene GC content ranges from a maximum of 69.2% at the top to 29.4% at the bottom.)

Why glyceraldehyde-3-phosphate dehydrogenase (GAPDH)? No real reason, except that it's a fairly universal (indeed, quite ancient) metabolic enzyme, reasonably compact (making possible a rendering that's not super-wide, as it would be for a larger gene), well-delineated genetically (not a fusion protein or an enzyme with multiple isoforms), and probably representative of a good many core metabolic enzymes. This is the enzyme that catalyzes the sixth step of glycolysis (sugar-breakdown). You may recall from Biochem 101 that the breakdown of glucose proceeds by splitting the twice phosphorylated molecule into two 3-carbon pieces. The triose phosphates in turn get phosphorylated by GAPDH before they transfer a phosphate to ADP to yield ATP, the 5-hour energy drink of all cells everywhere.

Alignment of genes was done via ClustalW in MEGA6 freeware. Rendering of the alignment FASTA file took about two seconds, in the browser, using 133 lines of custom JavaScript.

Saturday, May 31, 2014

Not All Organisms Use Amino Acids the Same Ways

Organisms vary greatly in the GC (guanine plus cytosine) content of their DNA, and yet all organisms can still make ribosomal proteins, DNA and RNA polymerases, and the various other essential proteins of life, no matter what their DNA vocabulary limitations might be. A high-GC organism like Streptomyces can make a given enzyme (DNA polymerase, say) using mostly G and C bases in its DNA, but a low-GC organism like Clostridium botulinum can also make the same kind of enzyme, even though it uses mostly A and T in its DNA. How is this possible?

It's possible in part because of the many synonyms for amino acids available in the genetic code. But it's a mistake to think the same amino acids are used in equal numbers by high-GC organisms and low-GC organisms. Organisms at opposite ends of the GC spectrum use different amino acids.

I was curious to see which amino acids correlate most strongly with genomic GC, so I gathered codon usage statistics for 109 organisms of widely varying genomic GC content and used JavaScript to calculate Pearson correlation coefficients for all 20 amino acids with respect to  GC content. The results are shown in the following table.

TABLE 1. Correlation (r) between amino acid usage and genome GC content (N=109 organisms). 

Code
Amino Acid
r
A
Alanine (Ala)
0.9634
R
Arginine (Arg)
0.9495
G
Glycine (Gly)
0.9472
P
Proline (Pro)
0.9436
V
Valine (Val)
0.7725
W
Tryptophan (Trp)
0.7497
H
Histidine (His)
0.4660
L
Leucine (Leu)
0.3364
D
Aspartic Acid (Asp)
0.3347
T
Threonine (Thr)
0.3099
C
Cysteine (Cys)
-0.1280
Q
Glutamine (Gln)
-0.1668
M
Methionine (Met)
-0.2863
E
Glutamic Acid (Glu)
-0.4621
S
Serine (Ser)
-0.6831
F
Phenylalanine (Phe)
-0.8550
Y
Tyrosine (Tyr)
-0.8983
K
Lysine (Lys)
-0.9389
N
Asparagine (Asn)
-0.9391
I
Isoleucine (Ile)
-0.9558

Ten amino acids correlate positively with GC and ten correlate negatively. Alanine and arginine have the strongest positive correlation with GC, while isoleucine and asparagine have the strongest negative correlation with genomic GC content. (But note that these data apply only to the 109 organisms studied. For the complete list of 109 organisms, see this post.)

If you were to extract all the amino acids out of Clostridium botulinum (28% GC), you would get far more lysine than alanine. Conversely, if you were to hydrolyze all the proteins in Streptomyces griseus (GC 72%), you would find far more alanine than lysine.

Interestingly, serine has six synonymous codons (AGT, AGC, CTA, CTG, CTC, CTT) and can just as easily be specified with G and C as with A and T; so overall, you'd expect little correlation with genomic GC. And yet serine use correlates strongly with low GC. In a sense, this is not surprising. Certain low-GC organisms (like Streptococcus) are known to produce serine-rich cell-coat proteins, some of which are important determinants of pathogenicity. But it may simply be that the high utilization of serine in low-GC organisms is related to one-carbon chemistry. Serine, after all, is the source of the methyl group that, by way of methylenetetrahydrofolate, converts dUMP to TMP (thymidine monophosphate, a DNA precursor). Any organism whose DNA is unusually rich in thymine (low in GC) will almost certainly be processing large quantities of serine. Serine is also a carbon source in the biosynthetic pathways for cysteine and methionine, both of which, like serine itself, are negatively correlated with genomic GC content.

Tuesday, May 27, 2014

An Evolutionary Debate that Misses the Point

One of many hotly debated topics in evolutionary biology is how codon usage bias (preference of an organism for certain codons, when other, synonymous variants are available) relates to transfer-RNA abundance. It's clear the two are related; no one disagrees on that. The question is whether codon usage bias is an outcome of tRNA abundance ratios, or the reverse.

What I think most people are missing in this argument is that the whole discussion might very well be mooted by a huge factor in tRNA evolution that no one seems to be taking into account. I'm talking about the fact that tRNAs are insertion targets for various kinds of mobile genetic elements, from phages to plasmid-borne genomic islands to transposons. As Jörg Hacker and Elisabeth Carniel point out in "Ecological fitness, genomic islands and bacterial pathogenicity" (EMBO Reports, 2001):
Genomic islands are part of the flexible bacterial gene pool and are somewhere between 10 and 100 kilobases (kb) in length. They frequently harbor phage- and/or plasmid-derived sequences, including transfer genes or integrases and IS elements. These particular blocks of DNA are most often inserted into tRNA genes and may be unstable.
(Emphasis added.) Transfer RNAs are constantly being "inserted into" (and next to, not always into) by mobile elements, a phenomenon that's been well studied not only in bacteria but in yeast and elsewhere. Over evolutionary timespans, tRNA genes are duplicated, then disrupted, over and over again, by mobile DNA elements. These elements (whether from phages, viruses, transposons, or what have you) are known to have played (and continue to play) a significant role in shaping genome diversity, across all taxa. This is not a trivial factor, in other words. Transfer RNA genes are insertion hotspots. Surely the patterns of tRNA disruption caused by gene-hopping, over the eons, cannot be unimportant in the determination of codon usage patterns.

Many examples can be found of ancient tRNA signatures inside the tail ends of protein genes, no doubt leftovers from millions of years of insertion events.

Monday, May 26, 2014

Shannon Entropy and DNA

Claude Shannon made an important finding when he realized that the information contribution of a symbol could be estimated very simply as -f(x) log(f(x)), where f(x) is the frequency of occurrence of the symbol x. For example, a series of a coin tosses can be considered a binary information stream with symbol values H and T (heads and tails). If the frequency of H is 0.5 and f(T) is 0.5, the entropy E, in bits per toss, is -0.5 times log (base 2) 0.5 for heads, and a similar value for tails. The values add up (in this case) to 1.0. The intuitive meaning of 1.0 (the Shannon entropy) is that a single coin toss conveys 1.0 bit of information. Contrast this with the situation that prevails when using a "weighted" or unfair penny that lands heads-up 70% of the time. We know intuitively that tossing such a coin will produce less information because we can predict the outcome (heads), to a degree. Something that's predictable is uninformative. Shannon's equation gives -0.7 times log(0.7) = 0.3602 for heads and -0.3 * log (0.3) = 0.5211 for tails, for an entropy of 0.8813 bits per toss. In this case we can say that a toss is 11.87% redundant.

Claude Shannon
DNA is an information stream resembling a series of four-sided-coin tosses, where the "coin" can land with values of A, T, G, or C. In some organisms, the four bases occur with equal rates (25% each), in which case the DNA has a Shannon entropy of 2.0 bits per base (which makes sense, in that a base can encode one of 22 possible values). But what about organisms in which the bases occur with unequal frequencies? For example, we know that many organisms have DNA with G+C content quite a bit less than (or in some cases more than) 50%. The information content of the DNA will be less than 2 bits per base in such cases.

As an example, let's take Clostridium botulinum (the source of "Botox" serum), a soil bacterium with unusually low G+C content, at 28%. If we go through the organism's 3,404 protein-coding genes, we find actual base contents of:

A    0.40189
T    0.30603
G    0.18255
C    0.10840

These numbers are for a single strand (the coding strand or "message" strand) of DNA, which is why A and T aren't equal. For whole DNA, of course, A =T and G = C, but that's not the case here. We're just interested in the message strand.

If we put the above base frequencies into the Shannon equation, we come up with a value of 1.8467 for the information content (in bits) of one base of C. botulinum DNA. The DNA is about 7.67% redundant on a zero-order entropy basis. The DNA may be over 70% A and T, but it's a long way from being a two-base (one bit) information stream. Each base encodes an average of 1.8467 bits of information, which is a surprising amount (surprisingly close to 2.0) for such a skewed alphabet.

Sunday, May 25, 2014

Chiggers, Scrub Tyhpus, and Pseudogenes

If you've ever been bitten by tiny red bugs in the garden, you're familiar with members of the Trombiculidae, a family of mites known variously as berry bugs, harvest mites, red bugs, scrub-itch mites, aoutas, or (in the southern U.S.) "chiggers."

In the United States., the garden-variety chigger is basically harmless, but in much of the world this tiny arthropod comes with a very nasty endosymbiont known as Orientia tsutsugamushi, which is a bacterium related to the Rickettsia organisms that cause various tick-borne diseases. Throughout much of the Orient, O. tsutsugamushi infections (from chigger bites) cause scrub typhus, which begins with a rash and fever but can progress to a cough, intestinal distress, swelling of the spleen, abnormal liver chemistry, and ultimately pneumonitis, encephalitis, and/or myocarditis and even death. Treatment with doxycycline, azithromycin, or chloramphenicol is usually successful.

The "harvest mite" (chigger) can carry scrub
typhus, although U.S varieties are typically harmless.
The sequenced genome for O. tsutsugamushi is available, and if you go to this link and click on "Click for features" at the bottom of the Dataset Information box you should be able to open up a table that shows the organism as having 1,182 protein-coding genes (quite a small number), plus an additional 1,994 pseudogenes (quite a huge number, by comparison). The "DNA Seqs" links in the table will let you download the DNA sequences of all the organism's genes and pseudogenes.

This is an extremely unusual situation, in that we're dealing with a bacterium that has more pseudogenes (switched-off, defunct, damaged genes) than regular genes, something that can be said of no other bacterium of which I'm aware. The leprosy bacterium (Mycobacterium leprae) is famed for having approximately 1100 pseudogenes and 1604 "normal" genes. Astonishingly, Orientia tsutsugamushi reverses that ratio, and then some.

We don't know for sure how old Orientia tsutsugamushi's pseudogenes are. A standard rule of thumb in biology is that microbial genomes experience one spontaneous mutation per chromosome per 300 generations. But this doesn't really help us decide how old Orientia's pseudogenes are, since the pseudogenes probably didn't arise one by one, indepedently, through accumulation of random mutations. More than likely, a massive pseudogenization event caused the simultaneous deactivation of a large, unknown number of the organism's genes (of which 1100 survive today as pseudogenes), much the same as has been hypothesized for M. leprae. We have good reason to believe M. leprae's pseudogenes are at least 9 million years old. It seems likely that the pseudogenes in Orientia are also quite old, or at least not terribly new.

To get more perspective on this, I analyzed Orientia's pseudogenes from a couple of perspectives. What I found, first of all, is that the pseudogenes are shorter than their non-pseudo counterparts, averaging 700 bases in length (versus 879 for normal genes). This is similar to the case with M. leprae (where pseudogenes are 795 bases long and normal genes average 1,098). The average shorter gene length for Orientia vis-a-vis M. leprae is consistent with the fact that this is a greatly gene-reduced low-GC (30.5%) endosymbiont, whereas the Mycobacterium family is (in theory) free-living, with higher GC content (57.8% for M. leprae; 65% or more for tuberculosis species).

I've written before about the fact that in most genes, in most organisms, codons tend to begin with a purine base. Therefore I decided to look at purine usage in base one of normal-gene codons versus pseudogene codons (pseudocodons?), finding the following distribution in normal genes:

Purine usage in base one of codons in Orientia tsutsugamushi (N=346,326 codons). No pseudogenes were included in this graph. See the next graph (below) for pseudogenes.

This graph leaves little doubt that most codons begin with a purine (A or G). The median AG1 value is 63.8%. Very few proteins lie to the left of x=0.50, and frankly some of those are probably misannotated as to reading frame.

The situation with pseudogenes is quite a bit different:

Purines in codon base one (AG1) of pseudogenes (N=462,933 codons) in Orientia.

Here we see that purine usage in codon base one is not as strong (median 58.4%), although clearly, plenty of codons still show AG1 above 60%, implying that many pseudogenes are still "in frame" (not frameshifted).

Interestingly, AG1 is not only higher in normal-gene codons than in pseudogene codons, it's also higher in codons associated with proteins of known function than for "hypothetical protein" genes. Only 41.3% of pseudogene codons have AG1 greater than 60%, whereas 66.7% of "hypothetical protein" genes have AG1 > 60% and 84.3% of genes with functional assignments have codon AG1 greater than 60%. This implies that some genes annotated as hypothetical proteins may, in reality, be pseudogenes that are incorrectly annotated. I'll return to that topic some other time.



Friday, May 23, 2014

Looking for LUCA

In 1964, Emile Zuckerkandl and Linus Pauling wrote a paper (published the following year) for the Journal of Theoretical Biology suggesting the use of amino-acid and nucleic-acid sequences for deducing phylogenetic relationships. Ever since then, biologists have been trying to use sequence data to get to the root of the tree of life. Darwinian logic says that at some point, all cells had to have diverged from a Last Universal Common Ancestor (LUCA). Unfortunately, as pointed out by Doolittle and others, the quest for LUCA is greatly complicated by mutational saturation effects, reductive genome loss in important members of the most ancient taxa, convergent evolution, and non-negligible (yet difficult to estimate) amounts of horizontal gene transfer, among other serious problems.

An evolutionary tree of life based on analysis of N=420 genomes of free-living organisms. Proteomes are taxa and protein fold superfamilies are character data. Adapted from Kim and Caetano-Anollés, BMC Evolutionary Biology (2011), 11:140. Click to enlarge. See text for discussion.

The difficulty (I won't say folly) of trying to construct a well-rooted tree of life is made evident in various failed attempts to trace common descent via protein sequences. In January 2010, a few months after the sequencing of the 1000th bacterial genome, Karin Lagesen, Dave W. Ussery, and Trudy M. Wassenaar published a paper in which they expressed surprise over the fact that when they looked all 1,000 then-existing genomes (the number is now more than 10 times that), they could not find a single protein that was conserved across all bacteria. (Here, "conserved" means >50% amino-acid sequence identity.) Harris et al. took a slightly different approach, using the Clusters of Orthologous Groups (COG) database to search for universally conserved genes that follow the same phylogenetic patterns as ribosomal RNA (and therefore might constitute the ancestral genetic core of today's cells). The upshot:
Of the roughly 3100 COGs analyzed, only 80 were found to occur in all organisms. Fifty of these universally present genes showed the same phylogenetic relationships as rRNA.
Harris et al. found that the majority of universally conserved three-domain COG genes (37 of 50) are physically associated with the ribosome. Surprisingly, they found that "relatively few genes encoding proteins involved in DNA replication or transcription from DNA to RNA proved to be three-domain." In particular, RNA polymerases (except for certain subunits) did not follow rRNA distribution patterns and are not conserved across the three domains of life (archaea, bacteria, and eukaryotes). Moreover, the only component of the replicative DNA polymerase in modern cells that was found to be conserved across domains was DnaN (COG0592), the gene for the “sliding clamp.”

These disappointing results are understandable and perhaps expected, given the huge amount of deck-reshuffling that's happened in three billion years. It might well be that genome sequence data, with its constant churn, represents the wrong level of granularity for deep-phylogenetic studies. What matters for organisms, after all, is function, and function is an outcome of protein tertiary structure, not just primary structure.

With that in mind, Kyung Mo Kim and Gustavo Caetano-Anollés in 2011 published a brilliant study in BMC Evolutionary Biology called "The proteomic complexity and rise of the primordial ancestor of diversified life," relying on major structural motifs as the unit of phylogenetic discrimination. Defining protein domains at the highly conserved fold superfamily (FSF) level of structure, Kim and Caetano-Anollés used an iterative, parsimony-based phylogenomic approach to reconstructing FSF repertoires as upper and lower bounds of a presumed urancestral proteome ("ur" here meaning universal). Their conclusion:
The minimum urancestral FSF set reveals the urancestor had advanced metabolic capabilities, was especially rich in nucleotide metabolism enzymes, had pathways for the biosynthesis of membrane sn1,2 glycerol ester and ether lipids, and had crucial elements of translation, including a primordial ribosome with protein synthesis capabilities. It lacked however fundamental functions, including transcription, processes for extracellular communication, and enzymes for deoxyribonucleotide synthesis. Proteomic history reveals the urancestor is closer to a simple progenote organism but harbors a rather complex set of modern molecular functions.
The paper is quite long (14,700 words) and often relentlessly technical, but convincingly restores the quest for LUCA to the firm empirical grounding that such a quest seemed (for a while) to have been robbed of after Doolittle's "Uprooting the Tree of Life" and Dagan and Martin's "The Tree of One Percent." 

While parasitic microorganisms were found to occupy some of the most ancient branches of the superkingdom tree, Kim and Caetano-Anollés nevertheless decided to omit such organisms from their study since reductive evolution (wholesale loss of entire families of enzymes and their control systems) might otherwise queer the results. The final set of free-living organisms included 48 archaeal, 239 bacterial, and 133 eukaryotic members. To avoid potential problems with long-branch attraction, the researchers wisely sampled (at random) equal numbers of proteomes per superkingdom and replicated trees of proteomes, so that bacterial data (which of course predominated) wouldn't swamp archaea or eukaryota.

Among the many fascinating findings in the study:
  • The earliest start of organismal diversification occurred sometime between 2.91 and 2.03 billion years ago.
  • Translation had metabolic origins. It appeared only after the emergence "of a large number of metabolic functions, but before enzymes necessary for the synthesis of DNA."
  • Proteomic analysis of extant fold superfamilies (FSFs) showed that "over 200 additional FSFs are necessary in urancestral FSF sets to account for the complexity of the simplest organism in existence today."
  • None of the domains present in ribonucleotide reductase (RDR) enzymes was present in the min_set (representing the LUCA lower bound of complexity). Further, "We note that the reduction of ribonucleotides to deoxyribonucleotides involves the production of an active site thiyl radical that requires contacts with cysteines in all protein domains of the catalytic subunit of the oligomeric enzymatic complex, suggesting modern ribonucleotide reductase functions is [sic] indeed derived."
  • Commenting on the known active-site domain homology between class III ribonucleotide reductase and pyruvate formate lyase (a link proposed to have mediated the RNA-to-DNA biological transition), Kim and Caetano-Anollés point out that phylogenomic analysis at the fold-family level suggests the pyruvate formate-lyase domain emerged later than its ribonucleotide reductase counterpart. Therefore it's likely that the urancestor stored genetic information as RNA and not DNA.
Kim and Caetano-Anollés note: "The urancestor had an advanced metabolic network, especially rich in nucleotide metabolism enzymes, had primordial pathways for the biosynthesis of membrane glycerol ether and ester lipids, crucial elements of translation, including amino-acyl tRNA synthases, regulatory factors, and a primordial ribosome with protein synthesis capabilities. It lacked however transcription and in advanced evolutionary stages stored genetic information in RNA (not DNA) molecules."

The authors have many interesting things to say about the evolution of archaeal and bacterial membrane-lipid chemistry (and much else). If you're a biologist and you haven't yet read the Kim and Caetano-Anollés paper, do yourself  favor and take a look at it now. It's a fascinating read, no matter what side of the LUCA fence you're on.

Thursday, May 22, 2014

Which Direction Does the Gene Point?

A maddening problem in genome annotation is determining the "sense" strand for a gene, especially when the gene is short and/or the genome has a high GC content (and thus contains few or no stop codons in reverse translation). To convince yourself this is a very real and serious problem, all you have to do is browse a few genomes to see the ridiculously high number of "hypothetical proteins" (over 40% in some genomes), bogus overlaps, genes that score BLAST hits in reverse-complement mode but not frame zero, and other artifacts that are a direct result of the aforesaid problem.

I've presented examples of this problem before, but just so there's no confusion, I want to show you a particularly maddening example so you can see what I'm talking about. (Then I'll suggest a solution.) The following graphic shows a region of E. coli UTI89 in which several genes are shown as overlapping (that is to say, existing on opposite strands of DNA in the same coverage area). Small overlaps sometimes happen between genes, but whole genes rarely, if ever, overlap, and never in clusters. The situation shown below is bogus, but you see it all the time in public genomes. In fact, some of the genes shown below also show up as overlaps in Mycobacterium abscessus M93 (see gene OUW_18941), Citrobacter koseri strain ATCC BAA-895 (gene CKO_00072), and quite a few others. Glimmer has choked on this exact situation many times, in many genomes.  

A region with overlapping genes in the genome of E. coli UTI89.

The big gene on the top strand, middle, is UTI89_C4288 (DNA sequence here). It's annotated as (what else?) a "hypothetical protein." The M. abscessus version of the gene (here) is marked as a "cellobiose phosphorylase," and you can find many BLAST hits (the Rothia version gives an E-value of 6.0×10-101) for similar "cellobiose phosphorylase" genes in other organisms at UniProt.org and elsewhere. Of course, they're all bogus and represent Glimmer choke points, but the question is how one can determine that, and be sure about it.

E. coli's hypothetical protein (UTI89_C4288) has a wobble-base GC percentage of 64.3%, whereas the gene on the opposite strand (just below it, pointing left), namely UTI89_C4287 (marked as "membrane-bound ATP synthase F1 sector alpha-subunit"), has GC3 = 54.5%. In a much higher-GC organism like Mycobacterium or Pseudomonas, you would find out which gene has the higher GC3 percentage and crown it the winner (and most of the time, you'd be right). In this case, it's not so simple. The gene with the higher GC3 value isn't necessarily the winner.

Of course, in this particular example, you can cheat and look at the identities of the genes in the immediate vicinity of the hypothetical protein, on the bottom strand, and if you do, you'll find that all of the bottom-strand genes are ATPase subunits. Mystery solved, right? Sure, in this particular case. But what about situations where overlapping genes are all shown as "hypothetical protein"? (You can find many such cases in the genome for Burkholderia pseudomallei strain 1710b, for example.) When a hypothetical overlaps a hypothetical in a low-GC genome, then what?

One of my favorite cheats (but this isn't the final solution!) is to check the gene's AG1 percentage (adenine plus guanine, codon base one). This percentage averages ~60% in something like 90% of protein-coding genes. The problem is, AG1 is often 60% whether you read the gene forward, or backward (off the antisense strand). The reverse complement of a gene usually has high AG1, because the forward AG3 is usually under 50%.

Almost any trick you can dream up will fail under edge cases. GC3 is helpful, but only in high-GC genomes. AG1 is helpful, but only sometimes. Shine Dalgarno signals are not universally used by all organisms, and even in those that do use them, they're usually reserved for highly conserved genes encoding things like ribosomal proteins. Gene context is helpful in some cases but not others.

It turns out, the best clue for positively identifying the correct strand and correct reading frame is codon usage frequency patterns. If you know what the codon frequencies are, genome-wide, for a given organism, you can use this information to good advantage, even if the genome (and therefore the codon table) contains inaccuracies. As long as the codon frequencies are approximately correct, you can use them to verify the reading frame of a protein-coding gene.

The algorithm I came up with is very simple, yet effective. For a given gene, read each triplet of bases sequentially, and score each triplet twice: keep two scores going. First, score it according to its frequency in the codon table for the organism. Then score it according to a second table developed for reverse-complement codons.

The following table shows codon frequencies in Caulobacter crescentus NA1000. If you were to encounter a "hypothetical protein" gene in Caulobacter, and you couldn't decide whether the strand assignment was correct or not, first develop a score for the gene by reading its triplets and adding the frequency value of the corresponding codon to the running total. For example, if you encounter the triplet "CTG," add 6.84 to the score (see table). For every occurrence of CTT, add 0.60, for CTC add 1.70, and so on, using the values in the table.

Codon frequencies for Caulobacter crescentus NA1000.
But you also have to create an anticodon frequency table as follows: For every codon in the original table, apply the same score to the corresponding reverse-complement codon in the "antcodon table." E.g., for CTG, the first table would contain 6.84 (as above), but the value 6.84 would apply to CAG (the reverse complement of CTG) in the second table. I call the first table the "forward" table and the second table the "back" table. One represents the frequencies of codons encountered in protein genes in the forward reading direction. The other represents those same frequencies applied to the reverse-complement of the codons (the same codons read in the reverse direction, off the opposite DNA strand).

When scoring an unknown gene, you tally a "forward table" score, and keep a separate score using the "back" table. When you're done, the gene's "forward table" score should be greater than the "back table" score. If it's not, you're reading the gene off the wrong strand.

When I scored all 3,737 C. crescentus CB15 genes using this technique, I found 136 genes that gave a "back" score higher than the "forward" score. Interestingly, when I checked the identities of those 136 putative "backwards-annotated" genes, 132 of them were listed as "hypothetical proteins." Only four genes with assigned functions gave suspect scores, and one of those (CC_0662) turns out to be a 100%-identity match for the reverse of gene CCNA_00700 in Caulobacter crescenstus NA1000.  The other three are less than 200 bases long and could well be non-coding regions.

For a more challenging test, I turned to the genome of Rothia mucilaginosa DY-18, one of the most disastrous annotation nightmares of all time. In the genome for DY-18 you will find 524 protein-coding genes (out of 1,905 total) that are involved in significant overlaps. (Some overlaps are 2-on-1, some are 1-on-1; but the genome is almost certainly overannotated by at least 260 genes.) I trained my program on the codon usage table of R. mucilaginosa M508 (which contains fewer overlaps than DY-18), then tallied codon and anticodon scores on all of DY-18's CDS genes. In the end, 276 genes gave scores indicative of a reversed reading frame. Of those, 265 were, in fact, involved in overlaps.

Codon scoring is such an effective method, I don't know why programs like Glimmer don't use it. It's quite obvious they're not using it, though, because every genome has reverse-annotated genes (by the hundreds, in some cases) that are easily detected using this simple method.

Here, for the record, are the Caulobacter crescentus CB15 genes that appear to be annotated on the wrong strand:

CC_0023
CC_0048
CC_0073
CC_0099
CC_0149
CC_0354
CC_0480
CC_0546
CC_0564
CC_0605
CC_0662
CC_0666
CC_0676
CC_0677
CC_0680
CC_0681
CC_0687
CC_0728
CC_0739
CC_0775
CC_0782
CC_0786
CC_0825
CC_0850
CC_0853
CC_0913
CC_0987
CC_0996
CC_0997
CC_1020
CC_1022
CC_1031
CC_1032
CC_1050
CC_1069
CC_1073
CC_1084
CC_1094
CC_1123
CC_1127
CC_1161
CC_1174
CC_1212
CC_1222
CC_1238
CC_1245
CC_1274
CC_1312
CC_1322
CC_1340
CC_1349
CC_1392
CC_1393
CC_1394
CC_1395
CC_1414
CC_1416
CC_1513
CC_1561
CC_1648
CC_1789
CC_1793
CC_2000
CC_2086
CC_2116
CC_2163
CC_2184
CC_2193
CC_2240
CC_2256
CC_2308
CC_2334
CC_2338
CC_2351
CC_2376
CC_2413
CC_2424
CC_2442
CC_2445
CC_2450
CC_2452
CC_2471
CC_2475
CC_2499
CC_2519
CC_2525
CC_2571
CC_2574
CC_2597
CC_2602
CC_2621
CC_2624
CC_2665
CC_2698
CC_2705
CC_2718
CC_2719
CC_2720
CC_2731
CC_2732
CC_2738
CC_2739
CC_2756
CC_2769
CC_2800
CC_2850
CC_2865
CC_2875
CC_2878
CC_2907
CC_2916
CC_2949
CC_3050
CC_3055
CC_3251
CC_3302
CC_3318
CC_3342
CC_3360
CC_3429
CC_3437
CC_3438
CC_3451
CC_3453
CC_3463
CC_3479
CC_3517
CC_3519
CC_3547
CC_3548
CC_3553
CC_3554
CC_3608
CC_3665
CC_3671
CC_3700

Wednesday, May 21, 2014

The Pseudogene Hall of Fame

For "budget reasons" (supposedly), the Joint Genome Institute now requires all users to be registered and approved before they can use the (taxpayer-funded) https://img.jgi.doe.gov site. Fortunately, my registration was approved and I can use the excellent online genomics tools there, one of which produced the following table of Top Ten Organisms by Pseudogene Count.

Organism
Genes
Pseudogenes
60745
11398
38115
9178
38612
7186
6325
4011
31392
3818
5379
1670
4760
1589
5775
1523
5016
1507
2750
1086

Again: Don't expect the above links to work if you're not a registered JGI user. (I don't know if they will work for you or not.) This list was automagically generated by the Department of Energy's Joint Genomes Institute and I thought you might get the same kick out of it that I got. It's eye-opening to see the ratio of pseudogenes to "normal" genes in these organisms. Isn't it?

Taking the top three spots are mouse, rat, and humans. (Note: These counts should be taken with a bit of caution, as some have estimated the number of pseudogenes in the human genome to be much higher than the 7,186 shown here.) All the other spots in the chart except Arabidopsis (which is a leafy plant) are bacteria. The leprosy bacterium, which I've written about before, comes in tenth place.

If you're not familiar with the concept of pseudogenes, you might want to look at this post. Basically we're talking about genes that are thought to be disabled and no longer functional in the normal sense, although they may well be functional in some as-yet-unappreciated sense. (Otherwise, evolutionary theory says they should have been eliminated from most genomes eons ago.)

Personally, I believe pseudogenes are as much a feature of DNA as regular genes; certainly in higher life forms, they occur in great numbers. The vast majority of bacterial genomes in public databases are shown as having no pseudogenes. I find that (how shall I say?) not at all credible. Some day it will be obvious that almost every genome harbors pseudogenes; we simply lack smart enough software to detect them all right now.

Tuesday, May 20, 2014

More Secrets of the Virus World

It's generally conceded that viruses evolve more rapidly than host cells, but the rates vary tremendously depending on the type of virus. Generally, large DNA viruses that infect algae (the phycodnavirus family) are considered to have some of the slowest rates of change, whereas the fastest-to-change viruses tend to be small RNA viruses that infect animal cells (e.g., HIV). In terms of substitutions per nucleotide per cell infection (s/n/c), one recent study found rates of 10−8 to 10−6 s/n/c for DNA viruses and 10−6 to 10−4 s/n/c for RNA viruses, which means the fastest-mutating viruses change 10,000 times faster than the slowest-mutating viruses.

Given the ultra-rapid rate of change of RNA viruses and their generally impressive level of adaptation to host-cell environments, one might expect a virus like HIV-2 to show a codon usage bias similar to that of the host. And that's approximately true.

HIV-2 codon usage (left), in DNA format (T for U), versus overall human-cell codon usage (right).

The above graph shows codon usage for HIV-2 on the left and codon usage for human cells on the right. (HIV is an RNA virus, but codons are shown here in DNA format, with T in place of U.) R-squared/adjusted comes to 0.2204, so we can't very well say confidently that the codon values are highly correlated. But if you look at the smaller bars (not the "peaky" ones), they tend to taper down on the left, just as on the right.

It might be instructive to go from one of the fastest-changing viruses in the biosphere (HIV) to one of the slowest, and see how its codon usage compares to that of its host. This time, we're looking at the large DNA virus known as PBCV-1 (left) versus its Chlorella host (an alga, right):

Codon usage in Paramecium bursaria Chlorella virus 1 (PBCV-1), left, and Chlorella variabilis strain NC64A, right.
These two data sets are not only not correlated, they appear to be anticorrelated, which is quite unexpected. Bear in mind, PBCV-1 is relatively large, with a genome of 330,601 base pairs encoding hundreds of proteins (and ten tRNAs). Thus the pattern shown here isn't likely to be random noise. Note that PBCV-1 has a genomic G+C content of 40%, versus 61% for the host, which is a pretty sizable separation. It's almost as if PBCV-1 has spent part of its life coexisting with an entirely different host.

Which brings me to the final and most intriguing (I might even say shocking) graphic, which compares codon usage in PBCV-1 virus with codon usage in Chlorella's own host, Paramecium.

Codon usage in PBCV-1 virus (left) and Paramecium (right).
Recall that when it is not free-living on its own, the tiny unicellular Chlorella alga has an endosymbiotic relationship with the comparatively much larger unicellular ciliate protist, Paramecium. That is to say, Chlorella can live inside Paramecium. Chlorella allows Paramecium to thrive in high-sunlight/low-nutrient conditions, whereas Paramecium, in return, gives the non-motile Chlorella free transportation and protection against viruses. (PBCV-1 can infect free-living Chlorella, but does not infect Chlorella living inside Paramecium.) As far as I know, no one has ever reported that PBCV-1 virus can infect Paramecium. Supposedly, it infects only free-living Chlorella And yet, we find that the pattern of codon usage in PBCV-1 is very strongly correlated with the pattern of codon usage in Paramecium. (R-squared/adjusted: 0.527.)

Paramecium filled with Chlorella cells.
This chart is a real shocker from a couple of standpoints. First, as I say, PBCV-1 virus is not known to infect Paramecium. And yet codon usage patterns in the virus are much more closely aligned to Paramecium's patterns than to Chlorella's. Notice that AAA is the No. 1 most-used codon in PBCV-1 as well as Paramecium. Seven of Paramecium's top ten codons are in PBCV-1's top ten.

Secondly, Paramecium doesn't use the standard genetic code! It uses the Ciliate Code (Translation Table 6), in which TAA and TAG encode glutamine instead of serving as stop codons. (TGA is the one and only stop codon in Table 6.) If Paramecium used the standard genetic code, the alignment of the two organisms would be even stronger.

Also interesting is that PBCV-1 and Paramecium are quite far apart in G+C content (the former is 40%, the latter is 28%).

Perhaps at some point in its past, PBCV-1 had a wider host range, one that included Paramecium. It's possible that even today, it has hosts other than Chlorella that have yet to be observed experimentally. Certainly, the pattern of codon usage is consistent with such an idea.

Sunday, May 18, 2014

Virus Genes Don't Come from Host Genes

There's a school of thought that says that viruses originated as escaped constellations of host genes. Virus genes have to originate from somewhere. One theory is that they started with host genes.

Trouble is, there's precious little evidence that viral genes originated from host genes, and plenty of evidence to the contrary. It may actually be that host genes came from viruses.

To say viral genes derive from host genes is like saying hemorrhoids derive from earlobes. Any resemblance is, at best, superficial.

In a previous post, I showed data for the relatively large phylogenetic distance between thymidine kinase genes in phages (viruses that attack bacteria) and their hosts. In one case, I showed that prophage genes (genes from viruses that have succeeded in integrating into the host DNA) are more similar to host genes than lytic-lifestyle phages, but even in the case of temperate phages, I think we have to be honest and say that a prophage is still an example of foreign DNA integrating into a host. (Prophage genes can usually be easily identified by their base composition, which differs noticeably from the base composition of host genes.) Once a prophage becomes fully integrated into the host, its DNA (under the influence of the host repairosome) will tend to ameliorate, taking on the base composition and other characteristics of the host DNA, making it superficially similar to host DNA.

What "other characteristics" does ameliorated DNA take on? Consider codon usage patterns. Recall that the genetic code is set up in such a way that most amino acids correspond to more than one codon (three-letter pattern) in the DNA. Leucine, for example, can be encoded six different ways (namely by base patterns CTA, CTG, CTT, CTC, TTA, and TTG). Likewise, alanine can be encoded four ways (GCA, GCG, GCT, or GCC). But specific organisms develop specific patterns of codon use, preferring certain synonyms over others. For example, Clostridium botulinum (the food-poisoning bug) overwhelmingly prefers to use TTA for leucine (rarely using the other 5 synonyms), whereas E. coli strongly prefers to use CTG (choosing it four-to-one over the next-most-used leucine codon). These codon preference patterns are highly specific to a given species and are thought to be related to the numbers and types of available transfer RNAs (tRNA) in the cell, although frankly it's still an open question whether codon usage adapted to tRNA availability or the reverse.

The idea that viruses mutate rapidly and evolve in close harmony with the hosts on which their reproduction depends suggests that virus codon preferences should match those of the host. (This would be particularly true if virus genes come from host genes.) Remember that a virus has no ribosomal machinery and must rely on the host's protein-making equipment in order to survive. Therefore it would make sense for a virus to adapt its codon usage patterns to the patterns most favored by the host equipment.

That's not what we find. When we look at the codon usage patterns of phage T4 (a classic enterobacterial phage) versus E. coli's codon usage, we find that they differ substantially:

Codon usage frequencies for T4 phage (left) and E. coli B (right).
In this graphic, host-cell codon usage frequencies are on the right while corresponding T4 virus frequencies are on the left. Note that the T4 chromosome encodes 274 protein genes, encompassing over 50,000 codons, so the graphic is based on fairly solid numbers; variations from E. coli can't be accounted for simply by statistical noise.

T4's codon preferences are so different from the host cell's, the T4 phage brings with it genes for 8 types of tRNA. But the differences in codon usage go well beyond 8 codons, so the presence of tRNA genes in T4 DNA doesn't, by itself, explain the divergence of the data.

But what about temperate phages, like Fels-2 (a prophage in the Salmonella genome)? Since prophage genes are, in effect, a permanent part of the host genome, we would expect to see some amelioration of codon usage. And in fact, that is what we do see:

Codon usage in Fels-2 phage (left) and Salmonella typhimurium LT2 (right).
Here, we see that the codon usage patterns of Fels-2 and its host are quite similar. The differences are easily accounted for by the fact that Fels-2 has only 47 protein-coding genes, and the amino acid composition of those genes is probably different enough from "average" host genes to sway the usage stats to the degree shown here. Nevertheless, codon usage patterns aren't sufficient to tell us where Fels-2 genes came from originally. That's still an open question. Like the Martians in War of the Worlds, Fels-2 genes probably came from "somewhere else."

Robbie: "What, you mean, like Europe?"

Tom Cruise character: "No, Robbie. Not like Europe."

Saturday, May 17, 2014

Evolution of Prophage Genes

Viruses have two modes of reproductive existence. In the familiar lytic cycle, the virus infects a cell, replicates itself until the cell bursts, and hundreds (or thousands) of virions are produced. But there is also a lysogenic mode of viral existence, in which the virus inserts a copy of its DNA in the host's own DNA. The viral DNA thus inserted becomes known as a prophage, which can remain dormant for long periods of time. The prophage can often be induced to enter a lytic cycle by exposure of cells to hydrogen peroxide or Mitomycin C. (Induction of phages in this fashion is thought to occur when a phage repressor protein is cleaved by recA after the latter is upregulated in the SOS response.)

Prophage genes are seen in a wide variety of bacteria (a 2008 paper estimated that over 60% of bacterial genomes contain prophage genes), and in fact human DNA is thought to contain at least 8% retroviral gene remnants. There's reason to suspect that certain large DNA animal viruses (such as herpes and vaccinia) have a lysogenic cycle. Certainly, viruses like varicella zoster (which can produce shingles many years after a person's initial infection) can remain dormant for decades before suddenly undergoing induction to a lytic phase.

Viruses that live an exclusively lytic lifecycle have relatively few opportunities to co-evolve with the host, because they spend little time in the host. Such a virus might spend years "hanging around" in the environment before encountering a host cell; then the lytic reproductive cycle may last only minutes or hours, and it's back to "hanging around" in the environment.

The situation is much different for a temperate virus (i.e., one that has a lysogenic cycle). A lysogenic virus essentially becomes an integral, first-class component of the host DNA and undergoes the same replication and repair processes that apply to host DNA. Accordingly, we should expect to see a much different pattern of evolution in the genes of lysogenic viruses (or prophages). And indeed we do.

The phylogenetic tree below was prepared using viral (phage) and bacterial genes for DNA adenine methylase (dam), an enzyme involved in DNA repair and replication. What's interesting about this gene is that many bacteria have their own (native) copies of this gene plus a prophage copy. And they differ, but not as much as, say, lytic-phage thymidine kinase versus native bacterial TK. (I showed phylo-trees for viral and bacterial TK enzymes in a prior post. If you'll recall, these enzymes differ so drastically that it's not at all clear that one derives from the other, ancestrally.)

DNA adenine methylase genes from three enteric bacteria and two phages (marked with asterisks). The top branch shows very close homology between prophage genes and their bacterial paralogues. The bottom branch shows that the native bacterial isoform of the enzyme is not as closely related to the prophage version(s).
With the dam genes, we see an interesting segregation pattern. There are two main branches to the phylo-tree. In the upper branch are the phage dam genes along with bacterial paralogues of these genes. The bottom branch shows how the non-paralogous (non-prophage) dam genes segregate.

To make these relationships clearer, here's a chart showing the overall G+C content as well as the GC3 (G+C content at codon base 3) for the various genes. The entries shaded in grey represent prophage genes. Notice that the G+C percentages are significantly lower for the prophage genes, but are higher than in free-living lytic-cycle phages (where GC3, in particular, is often less than 20%).


DNA adenine methylase genes for enteric bacteria and their temperate phages. Base-composition stats for prophage isoforms are shown in grey.
Organism Gene G+C GC3
Shigella sp. strain D9 ZP_05434596.1 49.60% 55.90%
E. coli EHW52521.1 49.60% 55.90%
S. enterica AAL22346.1 49.10% 54.50%
E. coli EHW55384.1 47.20% 47.90%
Salmonella phage RE-2010 YP_007003503.1 46.50% 46.20%
Shigella sp. strain D9 EGJ07993.1 46.20% 46.30%
S. enterica ETB92379.1 46.20% 43.00%
Fels-2 phage YP_001718754.1 46.20% 43.00%

If you compare the phylo tree shown further above with the phylo tree in my earlier post about thymidine kinase genes, you'll note that the prophage dam genes cluster very tightly with bacterial versions of these genes. That's because, as a fully integrated part of the genome, the prophage genes benefit from the host's DNA repairosome. They evolve gradually over long periods of time by the usual mechanisms. The genes are notably host-like because they're continuously repaired and groomed in the same manner as host DNA.

The takeaway here is: If you create a phylo-tree for a set of genes from hosts and viruses, and the genes cluster tightly with host versions, you're probably looking at the result of longterm lysogeny. On the other hand, if the virus genes do not cluster with host genes (as they usually don't!), that means you're looking at viruses that have a predominantly lytic mode of existence; viruses that probably got their genes from a far-distant ancestor of the modern-day host, if not from a primordial precellular precursor of some kind.

Thursday, May 15, 2014

How to Build Your Own Phylo Trees with Mega6

Mega6 is a popular freeware program for creating phylogenetic trees and sequence alignments, analyzing gene sequence data, performing various calculations (such as Tajima's D), computing pairwise distances between sequences, and doing lots of other types of genetic analysis. This program is widely used and is based partly on the work of Masatoshi Nei, the famed Penn State molecular geneticist and author of the seminal book Mutation-Driven Evolution (which I highly recommend). You need this program if you're into molecular genetics.

In my last post, I showed a phylogenetic tree that I created in Mega6 for thymidine kinases of phages and hosts. You can make trees of this sort yourself in a matter of minutes using Mega6. Let me take you through a quick example.

Here's how to recreate the tree involving thymidine kinase. Further below, I've listed the FASTA sequences you'll need. Obviously, you can obtain your own sequences from Uniprot.org or other sources, if you want to make your own phylo trees based on other sequences.

All you have to do is Copy and Paste FASTA sequences into the Alignment Explorer window of Mega6, then generate a Maximum Likelihood tree (or a Parsimony Tree, or whatever kind of tree your prefer).

Here's the exact procedure:
  1. Fire up Mega6.
  2. Click the Align button in the main taskbar. (See screenshot below.)
  3. Choose Edit/Build Alignment from the menu that pops up. A new window will open. (Note: A dialog may ask you if you are creating a new data set; answer Yes. A dialog may also appear asking whether the alignment you're creating will involve DNA, or Proteins. If you're using the amino-acid sequences shown further below, click Proteins.)
  4. In Mega6, click the Align button on the left side of the taskbar. Select Edit/Build Alignment.
  5. Paste FASTA sequences into the Alignment Explorer window. 
  6. Select All (Control-A).
  7. Choose Align by ClustalW from the Alignment menu. An alignment options dialog will appear. Accept all defaults by clicking OK.
  8. After the alignment operation finishes (it takes 5 seconds), go to the menu and choose Data > Export Alignment > Mega Format. Save the *.meg file to disk.
  9. Now go back to the main Mega6 window and click the Phylogeny button in the taskbar. A menu will drop down.
  10. Select the first item: Construct/Test Maximum Likelihood Tree. A dialog will appear asking if you want to use the currently active data set. Click No. This will bring up a file-navigation dialog. Use that dialog to find the *.meg file you created in Step 7. Select that file and click Open.
  11. A tree options dialog will appear. If you just want to see a tree quickly, accept the defaults and click OK. The tree will take less than 10 seconds to generate. If you want to do a test of phylogeny (this part isn't obvious!), click the item to the right of "Test of Phylogeny" (see the graphic below) and choose "Bootstrap method," then set the number of bootstraps using the dropdown menu (see screenshot below).
  12. Click the Compute button to build the tree.
If you want to do bootstrap validation of tree nodes, you have to click the yellow area to the right of Test of Phylogeny. (See red arrow above.) Choose Bootstrap Method. Then set Number of Bootstraps to 500.

Note that if you choose to do a bootstrap test of phylogeny, the tree may take 5 minutes or more (possibly much more) to build, depending on how many nodes it contains. Bootstrapping is a compute-intensive operation. The idea behind bootstrap testing is that node assignments in phylo trees are often uncertain, and one way to check the robustness of a given assignment is to systematically introduce noise into the data to see how readily a node can be made to jump branches. A node that can easily be tricked into jumping to a different spot in the tree is untrustworthy. The bootstrap test attempts to quantify the degree of reliability of the node assignments. Usually, you want to do at least several hundred tests (500 is considered adequate). If a given node jumps branches in half the tests, the tree will carry "50" (meaning 50% confidence) at the top of the branch, meaning there's a only 50% certainty that that particular node assignment is correct as to branch location. A tree where all the branches have numbers greater than 50 can usually be considered reliable. 

Mega6 will output Maximum Likelihood trees, parsimony trees, and other types of trees, and the program will do an amazing variety of calculations and statistical tests. Most times, you can save analyses in Excel format, which of course is a godsend in case you need to do additional analysis that can't be done in Mega6. The documentation contains many helpful tutorials; be sure to give it a look.

If I had a wish-list for Mega6, it would be quite short. Mainly I'd like to be able to see quick graphic summaries of things like the ratio of synonymous to non-synonymous mutations in two DNA sequences. (The data for this is available via the HyPhy command under the Selection taskbar button, but only as raw spreadsheet data; you have to sum the columns yourself in Excel to get certain kinds of summaries, and if you want a graph, you have to create it yourself. This is hardly a major drawback. Still, it would be nice to see more summary data, quickly, in graphical form.) When you align two or more genes in the Alignment Editor, it shows asterisks above the identical nucleotides, but doesn't show percent identity anywhere, nor percent "positives" for amino acid data. The Alignment Editor also doesn't respond properly (worse: it responds inappropriately) to mouse-wheel actions, jumping to the end of a file horizontally when you wanted to wheel-scroll vertically.

Also mildly annoying: the tree renderings are bitmaps; I would much prefer to see SVG (Scalable Vector Graphics) format, which in addition to being infinite-resolution (vector format) also allows easy editing of line widths, colors, fonts, labels, etc. in a simple text editor. As it is now, to edit line widths or colors in phylo-trees, you have to drag out Photoshop.

But overall, I have few significant complaints (and much praise) for Mega6. It's an immensely powerful program, it's fast, it's quite intuitive, and the best part is, it's free. (For a more detailed commentary on the program's design philosophy and capabilities, see this excellent 2011 paper by Tamura, Nei, et al. It was written at the time of Mega5, but applies equally to Mega6.)

Below are the FASTA sequences used in making the phylo tree for yesterday's post. You can Cut and Paste these sequences directly into Mega6's Alignment Explorer:

>sp|P13300|KITH_BPT4 Enterobacteria phage T4 
MASLIFTYAAMNAGKSASLLIAAHNYKERGMSVLVLKPAIDTRDSVCEVVSRIGIKQEAN
IITDDMDIFEFYKWAEAQKDIHCVFVDEAQFLKTEQVHQLSRIVDTYNVPVMAYGLRTDF
AGKLFEGSKELLAIADKLIELKAVCHCGKKAIMTARLMEDGTPVKEGNQICIGDEIYVSL
CRKHWNELTKKLG
>tr|S5MKX8|S5MKX8_9CAUD Yersinia phage PST 
MASLIFTYAAMNAGKSASLLTAAHNYKERGMSVLVLKPAIDTRDSVCEVVSRIGIKQEAN
IITDDMDIFEFYKWAEAQKDIHCVFVDEAQFLKTEQVHQLSRIVDTYNVPVMAYGLRTDF
AGKLFEGSKELLAIADKLIELKAVCHCGKKAIMTARLMEDGTPVKEGNQICIGDEIYVSL
CRKHWNELTKKLG
>tr|I7KRQ7|I7KRQ7_9CAUD Yersinia phage phiD1
MASLIFTYAAMNAGKSASLLTAAHNYKERGMSVLVLKPAIDTRDSVCEVVSRIGIKQEAN
IITDDMDIFEFYKWAEAQKDIHCVFVDEAQFLKTEQVHQLSRIVDTYNVPVMAYGLRTDF
AGKLFEGSKELLAIADKLIELKAVCHCGKKAIMTARLMEDGTPVKEGNQICIGDEIYVSL
CRKHWNELTKKLG
>tr|F2VXC8|F2VXC8_9CAUD Shigella phage Shfl2 
MASLIFTYAAMNAGKSASLLTAAHNYKERGMSVLVLKPAIDTRDSVCEVVSRIGIKQEAN
IITDDMDIFEFYKWAEAQKDIHCVFVDEAQFLKTEQVHQLSRIVDTYNVPVMAYGLRTDF
AGKLFEGSKELLAIADKLIELKAVCHCGKKAIMTARLMEDGTPVKEGNQICIGDEIYVSL
CRKHWNELTKKLG
>tr|I7J3X5|I7J3X5_9CAUD Yersinia phage phiR1-RT 
MAQLYYNYAAMNSGKSTSLLSVAHNYKERGMGTLVMKPAVDTRDSSSEIVSRIGIKLEAN
VIHPGMNIVEFFKWAQTQRDIHCVLIDEAQFLEPAQVQDLCKIVDIYNVPVMAYGLRTDF
RGELFPGSKALLQCADKLVELKGVCHCGKKATMVARIDINGNAVKDGAQIELGGEDKYVS
LCRKHWCEMLELY
>sp|Q98HR4|KITH_RHILO Rhizobium loti (strain MAFF303099) 
MAKLYFNYATMNAGKTTMLLQASYNYRERGMTTMLFVAGHYRKGDSGLISSRIGLETEAE
MFRDGDDLFARVAEHHDHTTVHCVFVDEAQFLEEEQVWQLARIADRLNIPVMCYGLRTDF
QGKLFSGSRALLAIADDLREVRTICRCGRKATMVVRLGADGKVARQGEQVAIGKDVYVSL
CRRHWEEEMGRAAPDDFIGFMKS
>tr|F0LSI7|F0LSI7_VIBFN Vibrio furnissii (strain DSM 14383 / NCTC 11218)
MAQMYFYYSAMNAGKSTTLLQSSFNYQERGMTPVIFTAAIDDRFGVGKVSSRIGLEADAH
LFTSDTNLFDAIKQLHQNEKRHCVLVDECQFLTKEQVYQLTEVVDKLDIPVLCYGLRTDF
LGELFEGSKYLLSWADKLIELKTICHCGRKANMVIRTDEHGNAISEGDQVAIGGNDKYVS
VCRQHYKEALGR
>sp|Q5E4F2|KITH_VIBF1 Vibrio fischeri (strain ATCC 700601 / ES114) 
MAQMYFYYSAMNAGKSTTLLQSSFNYQERGMNPAIFTAAIDDRYGVGKVSSRIGLHAEAH
LFNKETNVFDAIKELHEAEKLHCVLIDECQFLTKEQVYQLTEVVDKLNIPALCYGLRTDF
LGELFEGSKYLLSWADKLVELKTICHCGRKANMVIRTDEHGVAIADGDQVAIGGNELYVS
VCRRHYKEALGK
>tr|V2ABB3|V2ABB3_SALET Salmonella enterica subsp. enterica serovar Gaminara str. ATCC BAA-711
MAQLYFYYSAMNAGKSTALLQSSYNYQERGMRTVVYTAEIDDRFGAGKVSSRIGLSSPAK
LFNQNTSLFEEIRAESARQTIHCVLVDESQFLTRQQVYQLSEVVDKLDIPVLCYGLRTDF
RGELFVGSQYLLAWSDKLVELKTICFCGRKASMVLRLDQDGRPYNEGEQVVIGGNERYVS
VCRKHYKDALEEGSLTAIQERLR
>tr|I6H5M2|I6H5M2_SHIFL Shigella flexneri 1235-66
MAQLYFYYSAMNAGKSTALLQSSYNYQERGMRAVVYTAEIDDRFGAGKVSSRIGLSSPAK
LFNQNSSLFEEIRAENAQQRIHCVLVDESQFLTRQQVYELSEVVDQLDIPVLCYGLRTDF
RGELFGGSEYLLAWSDKLVELKTICFCGRKASMVLRLDQAGRPYNEGEQVVIGGNERYVS
VCRKHYKEAQSEGSLTAIQERHSHD
>sp|P23331|KITH_ECOLI Escherichia coli (strain K12)
MAQLYFYYSAMNAGKSTALLQSSYNYQERGMRTVVYTAEIDDRFGAGKVSSRIGLSSPAK
LFNQNSSLFDEIRAEHEQQAIHCVLVDECQFLTRQQVYELSEVVDQLDIPVLCYGLRTDF
RGELFIGSQYLLAWSDKLVELKTICFCGRKASMVLRLDQAGRPYNEGEQVVIGGNERYVS
VCRKHYKEALQVDSLTAIQERHRHD
>sp|Q66AM8|KITH_YERPS Yersinia pseudotuberculosis serotype I (strain IP32953
MAQLYFYYSAMNAGKSTALLQSSYNYQERGMRTLVFTAEIDNRFGVGTVSSRIGLSSQAQ
LYNSGTSLLSIIAAEHQDTPIHCILLDECQFLTKEQVQELCQVVDELHLPVLCYGLRTDF
LGELFPGSKYLLAWADKLVELKTICHCGRKANMVLRLDEQGRAVHNGEQVVIGGNESYVS
VCRRHYKEAIKAACCS
>tr|B4EXS0|B4EXS0_PROMH Proteus mirabilis (strain HI4320)
MAQLYFYYSAMNAGKSTSLLQSSYNYNERGMRTLIFTAAIDTRFAKGKVTSRIGLSADAL
LFSDDMNIRDAILAENNKEPIHCVLIDECQFLTKAHVEQLCEITDSYDIPVLTYGLRTDF
RGELFTGSAYLLAWADKLVELKTVCYCGRKANKVLRLAANGKVLSDGAQVEIGGNEKYVS
VCRKHYTEATLKGRIEQL
>tr|G0GHM0|G0GHM0_KLEPN Klebsiella pneumoniae KCTC 2242
MAQLYFYYSAMNAGKSTALLQSSYNYQERGMRTVVYTAEIDDRFGAGKVSSRIGLSSPAR
LYNPQTSLFDDIAAEHQLKPIHCVLVDESQFLTREQVHELSEVVDTLDIPVLCYGLRTDF
RGELFTGSQYLLAWSDKLVELKTICFCGRKASMVLRLDQEGRPYNEGEQVVIGGNERYVS
VCRKHYKEALSVGSLTKVQNQHRPC
>tr|F7YDB7|F7YDB7_MESOW Mesorhizobium opportunistum (strain LMG 24607 / HAMBI 3007 / WSM2075)
MAKLYFHYATMNAGKTTMLLQASYNYRERGMTTMLFVAGHYRKGDSGLISSRIGLETEAE
MFRDGDDLFARVAEHHQRSAVHCVFVDEAQFLEEEQVWQLARIADRLNIPVMCYGLRTDF
QGKLFSGSRALLAIADDLREVRTICRCGRKATMVVRLGPDGKVARQGEQVAIGKDVYVSL
CRRHWEEEMGRAAPDDFIGFVRN
>tr|H0H7T7|H0H7T7_RHIRD Agrobacterium tumefaciens 5A
MAKLYFNYAAMNAGKSTMLLQASYNYHERGMRTLIFTAAFDDRAGFGRVASRIGLSSDAR
TFDANTDIFSEVEALHAEAPVACVFIDEANFLSEHHVWQLAGIADRLNIPVMAYGLRTDF
QGKLFPASRELLAIADELREIRTICHCGRKATMVARFDNEGNVVKEGAQIDVGGNEKYVS
FCRRHWVETVKGD