Friday, July 12, 2013

Do-It-Yourself Phylogenetic Trees

I've been doing a lot of desktop science lately, and I'm happy to report that superb, easy-to-use online tools exist for creating your own phylogenetic trees based on gene similarities, something that's non-trivial to implement yourself.

The other day, I speculated that the fruit-fly Ogg1 gene, which encodes an enzyme designed to repair oxidatively damaged guanine residues in DNA, might derive from Archaea. The Archaea (in case you're not a microbiologist) comprise one of three super-kingdoms in the tree of life. Basically, all life on earth can be classified as either Archaeal, Eukaryotic, or Eubacterial. The Eubacteria are "true bacteria": they're what you and I think of when we think "bacteria." (So, think Staphylococcus and tetanus bacteria and E. coli and all the rest.) The Eukaryota are higher life forms, starting with yeast and fungi and algae and plankton, progressing up through grass and corn and pine trees, worms and rabbits and donkeys, all the way to the highest life form of all, Stephen Colbert. (A little joke there.) Eukaryotes have big, complex cells with a distinct nucleus, complex organelles (like mitochondria and chloroplasts), and a huge amount of DNA packaged into pairs of chromosomes.

Archaea look a lot like bacteria (they're tiny and lack a distinct nucleus, organelles, etc.), and were in fact considered bacteria until recently. But around the turn of the 21st century, Carl Woese and George E. Fox provided persuasive evidence that members of this group of organisms were so different in genetic profile (not to mention lifestyle) that they deserved their own taxonomic domain. Thus, we now recognize certain bacteria-like creatures as Archaea.

The technical considerations behind the distinction between bacteria and archeons are rather deep and have to do with codon usage patterns, ribosomal RNA structure, cell-wall details, lipid metabolism, and other esoterica, but one distinguishing feature of archeons that's easy to understand is their willingness to live under harsh conditions. Archaeal species tend to be what we call extremophiles: They usually (not always) take up residence in places that are incredibly salty, or incredibly hot, or incredibly alkaline or acidic.

While it's generally agreed that eukaryotes arose after Archaea and bacteria appeared, it's by no means clear whether Archaea and bacteria branched off independently from a common ancestor, or perhaps one arose from the other. (A popular theory right now is that Archaea arose from gram-positive bacteria and sought refuge in inhospitable habitats to escape the chemical-warfare tactics of the gram-positives.) A complication that makes studying this sort of thing harder is the fact that horizontal gene transfer has been known to happen (with surprising frequency, actually) across domains.

Is it possible to study phylogenetic relationships, yourself, on the desktop? Of course. One way to do it: Obtain the DNA sequences of a given gene as produced by a variety of organisms, then feed those gene sequences to a tool like the tree-making tool at http://www.phylogeny.fr. Voila! Instant phylogeny.

The Ogg1 gene is an interesting case, because although the DNA-repair enzyme encoded by this gene occurs in a wide variety of higher life forms, plus Archaea, it is not widespread among bacteria. Aside from a couple of Spirochaetes and one Bacteroides species, the only bacteria that have this particular gene are the members of class Clostridia (which are all strict anaerobes). Question: Did the Clostridia get this gene from anaerobic Archaea?

Using the excellent online CoGeBlast tool, I was able to build a list of organisms that have Ogg1 and obtain the relevant gene sequences, all with literally just a few mouse clicks. Once you run a search using CoGeBlast, you can check the checkboxes next to organisms in the results list, then select "Phylogenetics" from the dropdown menu at the bottom of the results list. (See screenshot.)


When you click the Go button, a new FastaView window will open up, containing the gene sequences of all the items whose checkboxes you checked in CoGeBlast. At the bottom of this FastaView window, there's a small box that looks like this:


Click Phylogeny.fr button (red arrow). Immediately, your sequences are sent to the French server where they'll be converted to a phylogenetic tree in a matter of one to two minutes (usually). The result is a tree that looks something like this:


I've color-coded this tree to make the results easier to interpret. Creating a tree of this kind is not without potential pitfalls, because for one thing, if your DNA sequences are of vastly unequal lengths, the groupings made by Phylogeny.fr are likely to reflect gene lengths more than true phylogeny. For this tree, I did various data checks to make sure we're comparing apples and apples. Even so, a sanity check is in order. Do the groupings make sense? They do, actually. At the very top of the diagram (color-coded in green) we find all the eukaryotes grouped together: fruit-fly (Drosophila), yeast (Saccharomyces), fungus (Aspergillus). At the bottom of the diagram, Clostridium species (purplish red) fall into a subtree of their own, next to a tiny subtree of Methoanobrevibacter. This actually makes a good deal of sense, because the two Methanobrevibacter species shown are inhabitants of feces, as are the nearby Clostridium bartletti and C. diff. The fact that all the salt-loving Archaea members group together (organisms with names starting with 'H') is also indicative of a sound grouping. Overall, the tree looks sound.

If you're wondering what all the numbers are, the scale bar at the bottom (0.4) shows the approximate percentage difference in DNA sequences associated with that particular length of tree depth. The red numbers on the tree branches are indicative of the probability that the immediately underlying nodes are related. Probably the most important thing to know is that the evolutionary distance between any two leaves in the tree is proportional to the sums of the branch lengths connecting them. (The branch lengths are not explicitly specified; you have to eyeball it.) At the top of the diagram, you can see that the branch lengths of the two Drosophila instances are very short. This means they're closely related. By contrast, the branch lengths for Saccharomyces and the ancestor to Drosophila are long, meaning that these organisms are distantly related.

Just to give you an idea of the relatedness, I checked the C. botulinum Ogg1 protein amino-acid sequence against C. tetani, and found 63% identity of amino acids. When I compared C. botulinum's enzyme against C. difficile's, there was 52% identity. With Drosophila there is only 32% identity, and even that applies only to a 46% coverage area (versus 90%+ for C. tetani and C. diff). Bottom line, the Blast-wise relatedness does appear to correspond, in sound fashion, to tree-wise relatedness.

Two things stand out. One is that not all of the Clostridium species group together. (There's a small cluster of Clostridia near the salt-lovers, then a main branch near the methane-producing Archaea. The out-group of Clostridia near the salt-lovers happen to all have chromosomal G+C content of 50% or more, which makes them quite different from the rest of the Clositridia, whose G+C is under 30%.) The other thing that stands out is that it does appear as if Clostridial Ogg1 could be Archaeal in origin, based on the relationship of Methanoplanus and Methanobrevibacter to the main group of Clostridia. (Also, the C. leptum group's Ogg1 may share an ancestor with the halophilic Archaea.) One thing we can say for sure is that Ogg1 is ancient.

It's tempting to speculate that the eukaryotes obtained Ogg1 from early mitochondria, and that early mitochondria were actually Archaeal endosymbionts. The first part is easily true, because we know that early mitochondria quickly exported most of their DNA to the host nucleus. (Today's mitochondrial DNA is vestigial. Well over 90% of mitochondrial genes are actually in the host nucleus. Things like mitochondrial DNA polymerase have to be transcribed from nucleus-generated RNA.) Whether or not early mitochondria were Archaeal endosymbionts, no one knows.

Anyway, I hope this shows how easy it is to generate phylogenetic trees from the comfort of a living room sofa, using nothing more than a laptop with wireless internet connection. Try making your own phylo-trees using CoGeBlast and Phylogeny.fr—and let me know what you find out.