Friday, May 16, 2014

Evolution of Viral Genes vs. Host Genes

Some viral genes show evidence of having an ancient origin, predating the divergence of host organisms into different species. The enzyme thymidine kinase (TK), for example, has followed a certain set of evolutionary paths in enteric bacteria: the E. coli version differs slightly from the Salmonella version, which differs from the Yersinia version or the Shigella version, but they all show show evidence of common descent from ancestors that included Vibrio and Proteus. By contrast, the bacteriophages (viruses that attack enteric bacteria) have their own thymidine kinases that followed different evolutionary trajectories, resulting in genes with quite different G+C contents. The thymdine kinase in today's T4 phage shows a closer phylogenetic relation to the TK of Rhizobium and Agrobacterium than to E. coli or Salmonella. I presented a phylogenetic tree to this effect in an earlier post.

But sometimes, viral genes follow host-gene evolution more closely. The thymidine kinases of eukaryotic organisms and their viruses provided a good example of this. Consider the following tree developed from thymidine kinase genes of Guinea pig (Cavia porcellus), bull (Bos taurus), Cowpox and Swinepox viruses, amoeba species (Entamoeba), amoeba Mimivirus, two algal viruses, and finally, two strains of the alga Micromonas.

Thymidine kinase genes for cowpox and swinepox viruses occur in the same sub-branch with bull (Bos taurus) and Guinea pig (Cavia porcellus) genes; see the top four lines. Likewise, the Mimivirus TK gene is not far from Entamoeba. and the TK genes for algal viruses cluster near the TK genes of the alga Micromonas. Branch confidence is high (after 500 bootstraps, no nodes separate with less than 67% confidence). The 0.1 scale marker (lower left) represents substitutions per site, as represented by leaf-node depth. Tree generated using Mega6 freeware.
If you're not familiar with interpreting these trees, node depth (horizontal line length) is proportional to the fraction of amino acid substitutions per site, with a line length of about a centimeter representing 10 substitutions per 100 amino acids (note the 0.1 marker, lower left). The numbers at the branch points (100, 91, 86, etc.) represent the confidence that nodes to the right were located in the proper branches. These numbers are perecntages, representing the number of bootstrap trials (out of 100) that resulted in no "node-jumping." (Bootstrap testing is a way of introducing systematic noise into sequences to try to "trick them" into jumping to a new spot in the tree. If a little noise makes a node switch locations, the original location is suspect.) Every node in this tree was tested with 500 bootstrap tests. Overall, we can be fairly certain the node locations are correct.

What does the tree mean? Unlike the situation with bacterial and bacteriophage kinases (see earlier post), the various thymidine kinases of the DNA viruses shown here do tend to evolve in parallel with host equivalents. However, this doesn't automatically mean the viral genes originated with host genes, because (for example) the amoeba version of thymidine kinase shares only 39% amino-acid sequence identity with the Mimivirus version. That's a lot of divergence. It means the viral version of the gene (no doubt highly optimized for viral needs) could have come from a very-long-ago ancestor of the present-day host. Likewise, the Bos taurus version of the TK gene has only 69% similarity with the cowpox version. By comparison, the bovine gene is 88% similar to the human gene. The cowpox thymidine kinase gene is further away (evolutionarily) from the host gene than human TK is from rainbow trout TK.

It's prudent to conclude that viral genes are not simply orthologues of host-gene counterparts; they certainly don't represent recent gene transfer events (because there's way too much sequence divergence, even accounting for faster evolution in DNA viruses than in host DNA). The viral genes come from "somewhere else," probably a primordial past. Any resemblance to present-day host genes is largely incidental. 

The great majority of so-called "virus hallmark genes" (involving things like capsid proteins) have no counterpart at all in modern host cells, with very rare exceptions. Large DNA viruses may have gotten their start a long, long time ago, possibly in the pre-cellular world, where communities of free-ranging genes (some "selfish," some less so) coexisted in a common broth, with the only "cells" being micro-compartments in ocean-bed minerals.

RNA viruses are an entirely different matter. It's well known that RNA viruses evolve thousands of times faster than other viruses and often evolve in close harmony with hosts. Even so, faster evolution doesn't mean RNA virus genes came from host cells. RNA-virus genes, too, are probably of ancient provenance, maybe predating cellular life. As Koonin et al. said in 2006:
The existence of several genes that are central to virus replication and structure, are shared by a broad variety of viruses but are missing from cellular genomes (virus hallmark genes) suggests the model of an ancient virus world, a flow of virus-specific genes that went uninterrupted from the precellular stage of life's evolution to this day. This concept is tightly linked to two key conjectures on evolution of cells: existence of a complex, precellular, compartmentalized but extensively mixing and recombining pool of genes, and origin of the eukaryotic cell by archaeo-bacterial fusion. The virus world concept and these models of major transitions in the evolution of cells provide complementary pieces of an emerging coherent picture of life's history.