Subscribe to our mailing list

* indicates required

Wednesday, May 14, 2014

Where Do Virus Genes Come From?

There's a memorable moment in War of the Worlds (Spielberg version) when the Tom Cruise character tries to explain to his son, while driving a minivan, that they have to leave town immediately because evil, marauding machines "from somewhere else" are on the rampage. Robbie (the son) says: "What, you mean, like, from Europe?"

That's the image that comes to mind when I try to explain where virus genes come from. They don't come from the mother ship (the host). Oh sure, in some cases they clearly do derive from host genes. But in most cases, they clearly don't. The overwhelming majority of viral genes have no counterparts in host cells, and even for those that do, the genes in question are rarely true host orthologues.

"No, Robbie, not like Europe."
Where do they come from, then?

Short answer: Somewhere else.

Using a program like the excellent (and free) Mega6 ("Molecular Evolutionary Genetic Analysis") you can easily create phylogenetic trees from genetic data (FASTA sequences, protein or DNA), and when you do this for genes that occur in both viruses and host cells, you can see how they separate in phylo-space.

Example: I decided to look at the gene for thymidine kinase, which occurs in most living things plus a certain number of undead, maybe not-quite-living (in the usual sense) things known as viruses. Thymine (T) is, of course, an essential ingredient of DNA. Thymidine kinase converts thymidine (thymine bonded to deoxyribose, on the left, below) to the phosphorylated form (TMP, right) so it can participate in DNA synthesis.  

It's important to note that thymidine (above, left) is not a synthesis product but a breakdown product. The biosynthetic pathway for TMP actually starts with dUMP, which is converted to TMP via thymidylate synthetase, an entirely different enzyme. Where does thymidine come from, then, and why does a cell need thymidine kinase? The answer is that thymidine is a breakdown product of DNA. When you eat plant or animals cells, the DNA in those cells gets broken down, and thymidine is one of the breakdown products. For cells, thymidine is a valuable product to have, so thymidine kinase recovers free thymidine via the reaction shown above. This is a scavenging pathway, in other words. Most organisms have it, but some don't (e.g., Pseudomonas lacks it).

When a virus attacks a cell, there's a lot of DNA turnover as the virus prepares to manufacture its own DNA, so thymidine kinase is a handy enzyme to have around if you're a virus. Many viruses, as a result, bring their own copy of the TK enzyme gene. But is the viral TK gene derived (in some kind of ancestral way) from the host cell's own TK gene? Not necessarily.

When you gather up the amino acid sequence data for a bunch of TK genes from bacteria and the phages (viruses) associated with them, you find that, phylogenetically speaking, the phage/viral TK genes are not very similar to the host genes.

Thymidine kinase phylo tree for enteric bacteria and their phages. (Click to enlarge.) Note that the phage genes (top cluster) segregate from the host genes in the lowermost branch. Interestingly, TK genes of various alphaproteobacteria of the Rhizobiales class cluster near the phage genes (much nearer than the enteric bacteria). This suggests a primordial origin of the phage genes rather than recent acquisition of TK from host cells.

In the above tree, phage (viral) genes cluster at the top. Host-cell kinases cluster at the bottom. The two clusters may be related to a distant ancestor (not shown), but one thing is certain: the phage versions of this gene are not simply a slight modification of the host gene. We know that's true because, remarkably, the thyK genes of certain alphaproteobacteria (Agrobacterium and its relatives) cluster with the phage genes, even though the bacteriophages are adapted to E. coli and Salmonella (and closely related enterics). In theory, the phage genes should cluster with the enteric bacteria, not with Agrobacterium and Rhizobium.

Where do the phage genes come from? Some have speculated that these phages originated with escaped bacterial secretion-system cassettes. That may well be, but the escape event had to have occurred many hundreds of millions of years ago. The T4 thymidine kinase gene has only 62% sequence homology with the E. coli gene. T4's version of the gene has a G+C content of 34%, with GC3 (third codon base) of just 19%. The E. coli version of the gene has overall G+C of 42% and GC3 of 36%. While it's generally conceded that evolution of viral genes occurs faster than host genes (particularly for RNA viruses and single-stranded DNA viruses), DNA viruses like T4 can't evolve outside of the host cell, as far as we know, and although DNA viruses may evolve faster than host DNA, they don't evolve thousands of times faster. (RNA viruses do evolve thousands of times faster, but that's an entirely different matter.)

To me, the above tree says that enteric phages diverged from the common ancestor of alphaproteobacteria and gammaproteobacteria (the latter include the enteric bacteria). We're talking hundreds of millions of years ago. (Note that E. coli and its closest relatives are thought to have diverged 140 million years ago.) A recent transfer of thymidine kinase genes from enteric bacteria to their phages is not credible. It's far more likely that an ancient precursor of today's T4 phage (and similar enteric phages) had the gene, and passed it down through the ages as the phages adapted to new hosts (first the alphaproteobacteria, then the phylogenetically newer enterics).

In tomorrow's post, I want to explain in detail how the above phylo tree was made and show you how you can make your own phylogenetic trees using the popular Mega6 program. If you've never used Mega6, you're in for a treat.  


  1. Hello and thank you again for questioning the origin of viral genes.

    May I assume that you have done the exercise for other viral genes and recovered a similar phylogeny? And what about all the different types of viruses and prokaryotes in the world - is the over-oversimplification here for making the post accessible (it was a great read) and shouldn't one check many more examples prior to discussing evolutionary relationships between different forms of life?

    My own most recent blog post was (in Greek) on the upcoming European Elections and the evolution of political parties. I only mention this for Robbie's sake… but also because there was a viral expansion of participating parties this time long (43 instead of 27 the past 3 times). Perhaps a stressed bacterial population started shooting to each other pieces of their genome?

  2. It depends on the gene and the virus/host system. You can find counterexamples for almost any theory on viral gene origins. What I have found is that in viruses or phages that have a temperate cycle, the host and viral versions of genes tend to be similar. In viruses that have only a lytic cycle, the phylogenetic distance can be very great, particularly in large DNA viruses. For example, very few Mimivirus genes have host orthologs and probably more of Mimivirus's genes came from bacteria or other horizontal sources than from amoeba itself.


Add a comment. Registration required because trolls.