Saturday, May 17, 2014

Evolution of Prophage Genes

Viruses have two modes of reproductive existence. In the familiar lytic cycle, the virus infects a cell, replicates itself until the cell bursts, and hundreds (or thousands) of virions are produced. But there is also a lysogenic mode of viral existence, in which the virus inserts a copy of its DNA in the host's own DNA. The viral DNA thus inserted becomes known as a prophage, which can remain dormant for long periods of time. The prophage can often be induced to enter a lytic cycle by exposure of cells to hydrogen peroxide or Mitomycin C. (Induction of phages in this fashion is thought to occur when a phage repressor protein is cleaved by recA after the latter is upregulated in the SOS response.)

Prophage genes are seen in a wide variety of bacteria (a 2008 paper estimated that over 60% of bacterial genomes contain prophage genes), and in fact human DNA is thought to contain at least 8% retroviral gene remnants. There's reason to suspect that certain large DNA animal viruses (such as herpes and vaccinia) have a lysogenic cycle. Certainly, viruses like varicella zoster (which can produce shingles many years after a person's initial infection) can remain dormant for decades before suddenly undergoing induction to a lytic phase.

Viruses that live an exclusively lytic lifecycle have relatively few opportunities to co-evolve with the host, because they spend little time in the host. Such a virus might spend years "hanging around" in the environment before encountering a host cell; then the lytic reproductive cycle may last only minutes or hours, and it's back to "hanging around" in the environment.

The situation is much different for a temperate virus (i.e., one that has a lysogenic cycle). A lysogenic virus essentially becomes an integral, first-class component of the host DNA and undergoes the same replication and repair processes that apply to host DNA. Accordingly, we should expect to see a much different pattern of evolution in the genes of lysogenic viruses (or prophages). And indeed we do.

The phylogenetic tree below was prepared using viral (phage) and bacterial genes for DNA adenine methylase (dam), an enzyme involved in DNA repair and replication. What's interesting about this gene is that many bacteria have their own (native) copies of this gene plus a prophage copy. And they differ, but not as much as, say, lytic-phage thymidine kinase versus native bacterial TK. (I showed phylo-trees for viral and bacterial TK enzymes in a prior post. If you'll recall, these enzymes differ so drastically that it's not at all clear that one derives from the other, ancestrally.)

DNA adenine methylase genes from three enteric bacteria and two phages (marked with asterisks). The top branch shows very close homology between prophage genes and their bacterial paralogues. The bottom branch shows that the native bacterial isoform of the enzyme is not as closely related to the prophage version(s).
With the dam genes, we see an interesting segregation pattern. There are two main branches to the phylo-tree. In the upper branch are the phage dam genes along with bacterial paralogues of these genes. The bottom branch shows how the non-paralogous (non-prophage) dam genes segregate.

To make these relationships clearer, here's a chart showing the overall G+C content as well as the GC3 (G+C content at codon base 3) for the various genes. The entries shaded in grey represent prophage genes. Notice that the G+C percentages are significantly lower for the prophage genes, but are higher than in free-living lytic-cycle phages (where GC3, in particular, is often less than 20%).


DNA adenine methylase genes for enteric bacteria and their temperate phages. Base-composition stats for prophage isoforms are shown in grey.
Organism Gene G+C GC3
Shigella sp. strain D9 ZP_05434596.1 49.60% 55.90%
E. coli EHW52521.1 49.60% 55.90%
S. enterica AAL22346.1 49.10% 54.50%
E. coli EHW55384.1 47.20% 47.90%
Salmonella phage RE-2010 YP_007003503.1 46.50% 46.20%
Shigella sp. strain D9 EGJ07993.1 46.20% 46.30%
S. enterica ETB92379.1 46.20% 43.00%
Fels-2 phage YP_001718754.1 46.20% 43.00%

If you compare the phylo tree shown further above with the phylo tree in my earlier post about thymidine kinase genes, you'll note that the prophage dam genes cluster very tightly with bacterial versions of these genes. That's because, as a fully integrated part of the genome, the prophage genes benefit from the host's DNA repairosome. They evolve gradually over long periods of time by the usual mechanisms. The genes are notably host-like because they're continuously repaired and groomed in the same manner as host DNA.

The takeaway here is: If you create a phylo-tree for a set of genes from hosts and viruses, and the genes cluster tightly with host versions, you're probably looking at the result of longterm lysogeny. On the other hand, if the virus genes do not cluster with host genes (as they usually don't!), that means you're looking at viruses that have a predominantly lytic mode of existence; viruses that probably got their genes from a far-distant ancestor of the modern-day host, if not from a primordial precellular precursor of some kind.

Friday, May 16, 2014

Evolution of Viral Genes vs. Host Genes

Some viral genes show evidence of having an ancient origin, predating the divergence of host organisms into different species. The enzyme thymidine kinase (TK), for example, has followed a certain set of evolutionary paths in enteric bacteria: the E. coli version differs slightly from the Salmonella version, which differs from the Yersinia version or the Shigella version, but they all show show evidence of common descent from ancestors that included Vibrio and Proteus. By contrast, the bacteriophages (viruses that attack enteric bacteria) have their own thymidine kinases that followed different evolutionary trajectories, resulting in genes with quite different G+C contents. The thymdine kinase in today's T4 phage shows a closer phylogenetic relation to the TK of Rhizobium and Agrobacterium than to E. coli or Salmonella. I presented a phylogenetic tree to this effect in an earlier post.

But sometimes, viral genes follow host-gene evolution more closely. The thymidine kinases of eukaryotic organisms and their viruses provided a good example of this. Consider the following tree developed from thymidine kinase genes of Guinea pig (Cavia porcellus), bull (Bos taurus), Cowpox and Swinepox viruses, amoeba species (Entamoeba), amoeba Mimivirus, two algal viruses, and finally, two strains of the alga Micromonas.

Thymidine kinase genes for cowpox and swinepox viruses occur in the same sub-branch with bull (Bos taurus) and Guinea pig (Cavia porcellus) genes; see the top four lines. Likewise, the Mimivirus TK gene is not far from Entamoeba. and the TK genes for algal viruses cluster near the TK genes of the alga Micromonas. Branch confidence is high (after 500 bootstraps, no nodes separate with less than 67% confidence). The 0.1 scale marker (lower left) represents substitutions per site, as represented by leaf-node depth. Tree generated using Mega6 freeware.
If you're not familiar with interpreting these trees, node depth (horizontal line length) is proportional to the fraction of amino acid substitutions per site, with a line length of about a centimeter representing 10 substitutions per 100 amino acids (note the 0.1 marker, lower left). The numbers at the branch points (100, 91, 86, etc.) represent the confidence that nodes to the right were located in the proper branches. These numbers are perecntages, representing the number of bootstrap trials (out of 100) that resulted in no "node-jumping." (Bootstrap testing is a way of introducing systematic noise into sequences to try to "trick them" into jumping to a new spot in the tree. If a little noise makes a node switch locations, the original location is suspect.) Every node in this tree was tested with 500 bootstrap tests. Overall, we can be fairly certain the node locations are correct.

What does the tree mean? Unlike the situation with bacterial and bacteriophage kinases (see earlier post), the various thymidine kinases of the DNA viruses shown here do tend to evolve in parallel with host equivalents. However, this doesn't automatically mean the viral genes originated with host genes, because (for example) the amoeba version of thymidine kinase shares only 39% amino-acid sequence identity with the Mimivirus version. That's a lot of divergence. It means the viral version of the gene (no doubt highly optimized for viral needs) could have come from a very-long-ago ancestor of the present-day host. Likewise, the Bos taurus version of the TK gene has only 69% similarity with the cowpox version. By comparison, the bovine gene is 88% similar to the human gene. The cowpox thymidine kinase gene is further away (evolutionarily) from the host gene than human TK is from rainbow trout TK.

It's prudent to conclude that viral genes are not simply orthologues of host-gene counterparts; they certainly don't represent recent gene transfer events (because there's way too much sequence divergence, even accounting for faster evolution in DNA viruses than in host DNA). The viral genes come from "somewhere else," probably a primordial past. Any resemblance to present-day host genes is largely incidental. 

The great majority of so-called "virus hallmark genes" (involving things like capsid proteins) have no counterpart at all in modern host cells, with very rare exceptions. Large DNA viruses may have gotten their start a long, long time ago, possibly in the pre-cellular world, where communities of free-ranging genes (some "selfish," some less so) coexisted in a common broth, with the only "cells" being micro-compartments in ocean-bed minerals.

RNA viruses are an entirely different matter. It's well known that RNA viruses evolve thousands of times faster than other viruses and often evolve in close harmony with hosts. Even so, faster evolution doesn't mean RNA virus genes came from host cells. RNA-virus genes, too, are probably of ancient provenance, maybe predating cellular life. As Koonin et al. said in 2006:
The existence of several genes that are central to virus replication and structure, are shared by a broad variety of viruses but are missing from cellular genomes (virus hallmark genes) suggests the model of an ancient virus world, a flow of virus-specific genes that went uninterrupted from the precellular stage of life's evolution to this day. This concept is tightly linked to two key conjectures on evolution of cells: existence of a complex, precellular, compartmentalized but extensively mixing and recombining pool of genes, and origin of the eukaryotic cell by archaeo-bacterial fusion. The virus world concept and these models of major transitions in the evolution of cells provide complementary pieces of an emerging coherent picture of life's history.

Thursday, May 15, 2014

How to Build Your Own Phylo Trees with Mega6

Mega6 is a popular freeware program for creating phylogenetic trees and sequence alignments, analyzing gene sequence data, performing various calculations (such as Tajima's D), computing pairwise distances between sequences, and doing lots of other types of genetic analysis. This program is widely used and is based partly on the work of Masatoshi Nei, the famed Penn State molecular geneticist and author of the seminal book Mutation-Driven Evolution (which I highly recommend). You need this program if you're into molecular genetics.

In my last post, I showed a phylogenetic tree that I created in Mega6 for thymidine kinases of phages and hosts. You can make trees of this sort yourself in a matter of minutes using Mega6. Let me take you through a quick example.

Here's how to recreate the tree involving thymidine kinase. Further below, I've listed the FASTA sequences you'll need. Obviously, you can obtain your own sequences from Uniprot.org or other sources, if you want to make your own phylo trees based on other sequences.

All you have to do is Copy and Paste FASTA sequences into the Alignment Explorer window of Mega6, then generate a Maximum Likelihood tree (or a Parsimony Tree, or whatever kind of tree your prefer).

Here's the exact procedure:
  1. Fire up Mega6.
  2. Click the Align button in the main taskbar. (See screenshot below.)
  3. Choose Edit/Build Alignment from the menu that pops up. A new window will open. (Note: A dialog may ask you if you are creating a new data set; answer Yes. A dialog may also appear asking whether the alignment you're creating will involve DNA, or Proteins. If you're using the amino-acid sequences shown further below, click Proteins.)
  4. In Mega6, click the Align button on the left side of the taskbar. Select Edit/Build Alignment.
  5. Paste FASTA sequences into the Alignment Explorer window. 
  6. Select All (Control-A).
  7. Choose Align by ClustalW from the Alignment menu. An alignment options dialog will appear. Accept all defaults by clicking OK.
  8. After the alignment operation finishes (it takes 5 seconds), go to the menu and choose Data > Export Alignment > Mega Format. Save the *.meg file to disk.
  9. Now go back to the main Mega6 window and click the Phylogeny button in the taskbar. A menu will drop down.
  10. Select the first item: Construct/Test Maximum Likelihood Tree. A dialog will appear asking if you want to use the currently active data set. Click No. This will bring up a file-navigation dialog. Use that dialog to find the *.meg file you created in Step 7. Select that file and click Open.
  11. A tree options dialog will appear. If you just want to see a tree quickly, accept the defaults and click OK. The tree will take less than 10 seconds to generate. If you want to do a test of phylogeny (this part isn't obvious!), click the item to the right of "Test of Phylogeny" (see the graphic below) and choose "Bootstrap method," then set the number of bootstraps using the dropdown menu (see screenshot below).
  12. Click the Compute button to build the tree.
If you want to do bootstrap validation of tree nodes, you have to click the yellow area to the right of Test of Phylogeny. (See red arrow above.) Choose Bootstrap Method. Then set Number of Bootstraps to 500.

Note that if you choose to do a bootstrap test of phylogeny, the tree may take 5 minutes or more (possibly much more) to build, depending on how many nodes it contains. Bootstrapping is a compute-intensive operation. The idea behind bootstrap testing is that node assignments in phylo trees are often uncertain, and one way to check the robustness of a given assignment is to systematically introduce noise into the data to see how readily a node can be made to jump branches. A node that can easily be tricked into jumping to a different spot in the tree is untrustworthy. The bootstrap test attempts to quantify the degree of reliability of the node assignments. Usually, you want to do at least several hundred tests (500 is considered adequate). If a given node jumps branches in half the tests, the tree will carry "50" (meaning 50% confidence) at the top of the branch, meaning there's a only 50% certainty that that particular node assignment is correct as to branch location. A tree where all the branches have numbers greater than 50 can usually be considered reliable. 

Mega6 will output Maximum Likelihood trees, parsimony trees, and other types of trees, and the program will do an amazing variety of calculations and statistical tests. Most times, you can save analyses in Excel format, which of course is a godsend in case you need to do additional analysis that can't be done in Mega6. The documentation contains many helpful tutorials; be sure to give it a look.

If I had a wish-list for Mega6, it would be quite short. Mainly I'd like to be able to see quick graphic summaries of things like the ratio of synonymous to non-synonymous mutations in two DNA sequences. (The data for this is available via the HyPhy command under the Selection taskbar button, but only as raw spreadsheet data; you have to sum the columns yourself in Excel to get certain kinds of summaries, and if you want a graph, you have to create it yourself. This is hardly a major drawback. Still, it would be nice to see more summary data, quickly, in graphical form.) When you align two or more genes in the Alignment Editor, it shows asterisks above the identical nucleotides, but doesn't show percent identity anywhere, nor percent "positives" for amino acid data. The Alignment Editor also doesn't respond properly (worse: it responds inappropriately) to mouse-wheel actions, jumping to the end of a file horizontally when you wanted to wheel-scroll vertically.

Also mildly annoying: the tree renderings are bitmaps; I would much prefer to see SVG (Scalable Vector Graphics) format, which in addition to being infinite-resolution (vector format) also allows easy editing of line widths, colors, fonts, labels, etc. in a simple text editor. As it is now, to edit line widths or colors in phylo-trees, you have to drag out Photoshop.

But overall, I have few significant complaints (and much praise) for Mega6. It's an immensely powerful program, it's fast, it's quite intuitive, and the best part is, it's free. (For a more detailed commentary on the program's design philosophy and capabilities, see this excellent 2011 paper by Tamura, Nei, et al. It was written at the time of Mega5, but applies equally to Mega6.)

Below are the FASTA sequences used in making the phylo tree for yesterday's post. You can Cut and Paste these sequences directly into Mega6's Alignment Explorer:

>sp|P13300|KITH_BPT4 Enterobacteria phage T4 
MASLIFTYAAMNAGKSASLLIAAHNYKERGMSVLVLKPAIDTRDSVCEVVSRIGIKQEAN
IITDDMDIFEFYKWAEAQKDIHCVFVDEAQFLKTEQVHQLSRIVDTYNVPVMAYGLRTDF
AGKLFEGSKELLAIADKLIELKAVCHCGKKAIMTARLMEDGTPVKEGNQICIGDEIYVSL
CRKHWNELTKKLG
>tr|S5MKX8|S5MKX8_9CAUD Yersinia phage PST 
MASLIFTYAAMNAGKSASLLTAAHNYKERGMSVLVLKPAIDTRDSVCEVVSRIGIKQEAN
IITDDMDIFEFYKWAEAQKDIHCVFVDEAQFLKTEQVHQLSRIVDTYNVPVMAYGLRTDF
AGKLFEGSKELLAIADKLIELKAVCHCGKKAIMTARLMEDGTPVKEGNQICIGDEIYVSL
CRKHWNELTKKLG
>tr|I7KRQ7|I7KRQ7_9CAUD Yersinia phage phiD1
MASLIFTYAAMNAGKSASLLTAAHNYKERGMSVLVLKPAIDTRDSVCEVVSRIGIKQEAN
IITDDMDIFEFYKWAEAQKDIHCVFVDEAQFLKTEQVHQLSRIVDTYNVPVMAYGLRTDF
AGKLFEGSKELLAIADKLIELKAVCHCGKKAIMTARLMEDGTPVKEGNQICIGDEIYVSL
CRKHWNELTKKLG
>tr|F2VXC8|F2VXC8_9CAUD Shigella phage Shfl2 
MASLIFTYAAMNAGKSASLLTAAHNYKERGMSVLVLKPAIDTRDSVCEVVSRIGIKQEAN
IITDDMDIFEFYKWAEAQKDIHCVFVDEAQFLKTEQVHQLSRIVDTYNVPVMAYGLRTDF
AGKLFEGSKELLAIADKLIELKAVCHCGKKAIMTARLMEDGTPVKEGNQICIGDEIYVSL
CRKHWNELTKKLG
>tr|I7J3X5|I7J3X5_9CAUD Yersinia phage phiR1-RT 
MAQLYYNYAAMNSGKSTSLLSVAHNYKERGMGTLVMKPAVDTRDSSSEIVSRIGIKLEAN
VIHPGMNIVEFFKWAQTQRDIHCVLIDEAQFLEPAQVQDLCKIVDIYNVPVMAYGLRTDF
RGELFPGSKALLQCADKLVELKGVCHCGKKATMVARIDINGNAVKDGAQIELGGEDKYVS
LCRKHWCEMLELY
>sp|Q98HR4|KITH_RHILO Rhizobium loti (strain MAFF303099) 
MAKLYFNYATMNAGKTTMLLQASYNYRERGMTTMLFVAGHYRKGDSGLISSRIGLETEAE
MFRDGDDLFARVAEHHDHTTVHCVFVDEAQFLEEEQVWQLARIADRLNIPVMCYGLRTDF
QGKLFSGSRALLAIADDLREVRTICRCGRKATMVVRLGADGKVARQGEQVAIGKDVYVSL
CRRHWEEEMGRAAPDDFIGFMKS
>tr|F0LSI7|F0LSI7_VIBFN Vibrio furnissii (strain DSM 14383 / NCTC 11218)
MAQMYFYYSAMNAGKSTTLLQSSFNYQERGMTPVIFTAAIDDRFGVGKVSSRIGLEADAH
LFTSDTNLFDAIKQLHQNEKRHCVLVDECQFLTKEQVYQLTEVVDKLDIPVLCYGLRTDF
LGELFEGSKYLLSWADKLIELKTICHCGRKANMVIRTDEHGNAISEGDQVAIGGNDKYVS
VCRQHYKEALGR
>sp|Q5E4F2|KITH_VIBF1 Vibrio fischeri (strain ATCC 700601 / ES114) 
MAQMYFYYSAMNAGKSTTLLQSSFNYQERGMNPAIFTAAIDDRYGVGKVSSRIGLHAEAH
LFNKETNVFDAIKELHEAEKLHCVLIDECQFLTKEQVYQLTEVVDKLNIPALCYGLRTDF
LGELFEGSKYLLSWADKLVELKTICHCGRKANMVIRTDEHGVAIADGDQVAIGGNELYVS
VCRRHYKEALGK
>tr|V2ABB3|V2ABB3_SALET Salmonella enterica subsp. enterica serovar Gaminara str. ATCC BAA-711
MAQLYFYYSAMNAGKSTALLQSSYNYQERGMRTVVYTAEIDDRFGAGKVSSRIGLSSPAK
LFNQNTSLFEEIRAESARQTIHCVLVDESQFLTRQQVYQLSEVVDKLDIPVLCYGLRTDF
RGELFVGSQYLLAWSDKLVELKTICFCGRKASMVLRLDQDGRPYNEGEQVVIGGNERYVS
VCRKHYKDALEEGSLTAIQERLR
>tr|I6H5M2|I6H5M2_SHIFL Shigella flexneri 1235-66
MAQLYFYYSAMNAGKSTALLQSSYNYQERGMRAVVYTAEIDDRFGAGKVSSRIGLSSPAK
LFNQNSSLFEEIRAENAQQRIHCVLVDESQFLTRQQVYELSEVVDQLDIPVLCYGLRTDF
RGELFGGSEYLLAWSDKLVELKTICFCGRKASMVLRLDQAGRPYNEGEQVVIGGNERYVS
VCRKHYKEAQSEGSLTAIQERHSHD
>sp|P23331|KITH_ECOLI Escherichia coli (strain K12)
MAQLYFYYSAMNAGKSTALLQSSYNYQERGMRTVVYTAEIDDRFGAGKVSSRIGLSSPAK
LFNQNSSLFDEIRAEHEQQAIHCVLVDECQFLTRQQVYELSEVVDQLDIPVLCYGLRTDF
RGELFIGSQYLLAWSDKLVELKTICFCGRKASMVLRLDQAGRPYNEGEQVVIGGNERYVS
VCRKHYKEALQVDSLTAIQERHRHD
>sp|Q66AM8|KITH_YERPS Yersinia pseudotuberculosis serotype I (strain IP32953
MAQLYFYYSAMNAGKSTALLQSSYNYQERGMRTLVFTAEIDNRFGVGTVSSRIGLSSQAQ
LYNSGTSLLSIIAAEHQDTPIHCILLDECQFLTKEQVQELCQVVDELHLPVLCYGLRTDF
LGELFPGSKYLLAWADKLVELKTICHCGRKANMVLRLDEQGRAVHNGEQVVIGGNESYVS
VCRRHYKEAIKAACCS
>tr|B4EXS0|B4EXS0_PROMH Proteus mirabilis (strain HI4320)
MAQLYFYYSAMNAGKSTSLLQSSYNYNERGMRTLIFTAAIDTRFAKGKVTSRIGLSADAL
LFSDDMNIRDAILAENNKEPIHCVLIDECQFLTKAHVEQLCEITDSYDIPVLTYGLRTDF
RGELFTGSAYLLAWADKLVELKTVCYCGRKANKVLRLAANGKVLSDGAQVEIGGNEKYVS
VCRKHYTEATLKGRIEQL
>tr|G0GHM0|G0GHM0_KLEPN Klebsiella pneumoniae KCTC 2242
MAQLYFYYSAMNAGKSTALLQSSYNYQERGMRTVVYTAEIDDRFGAGKVSSRIGLSSPAR
LYNPQTSLFDDIAAEHQLKPIHCVLVDESQFLTREQVHELSEVVDTLDIPVLCYGLRTDF
RGELFTGSQYLLAWSDKLVELKTICFCGRKASMVLRLDQEGRPYNEGEQVVIGGNERYVS
VCRKHYKEALSVGSLTKVQNQHRPC
>tr|F7YDB7|F7YDB7_MESOW Mesorhizobium opportunistum (strain LMG 24607 / HAMBI 3007 / WSM2075)
MAKLYFHYATMNAGKTTMLLQASYNYRERGMTTMLFVAGHYRKGDSGLISSRIGLETEAE
MFRDGDDLFARVAEHHQRSAVHCVFVDEAQFLEEEQVWQLARIADRLNIPVMCYGLRTDF
QGKLFSGSRALLAIADDLREVRTICRCGRKATMVVRLGPDGKVARQGEQVAIGKDVYVSL
CRRHWEEEMGRAAPDDFIGFVRN
>tr|H0H7T7|H0H7T7_RHIRD Agrobacterium tumefaciens 5A
MAKLYFNYAAMNAGKSTMLLQASYNYHERGMRTLIFTAAFDDRAGFGRVASRIGLSSDAR
TFDANTDIFSEVEALHAEAPVACVFIDEANFLSEHHVWQLAGIADRLNIPVMAYGLRTDF
QGKLFPASRELLAIADELREIRTICHCGRKATMVARFDNEGNVVKEGAQIDVGGNEKYVS
FCRRHWVETVKGD

Wednesday, May 14, 2014

Where Do Virus Genes Come From?

There's a memorable moment in War of the Worlds (Spielberg version) when the Tom Cruise character tries to explain to his son, while driving a minivan, that they have to leave town immediately because evil, marauding machines "from somewhere else" are on the rampage. Robbie (the son) says: "What, you mean, like, from Europe?"

That's the image that comes to mind when I try to explain where virus genes come from. They don't come from the mother ship (the host). Oh sure, in some cases they clearly do derive from host genes. But in most cases, they clearly don't. The overwhelming majority of viral genes have no counterparts in host cells, and even for those that do, the genes in question are rarely true host orthologues.

"No, Robbie, not like Europe."
Where do they come from, then?

Short answer: Somewhere else.

Using a program like the excellent (and free) Mega6 ("Molecular Evolutionary Genetic Analysis") you can easily create phylogenetic trees from genetic data (FASTA sequences, protein or DNA), and when you do this for genes that occur in both viruses and host cells, you can see how they separate in phylo-space.

Example: I decided to look at the gene for thymidine kinase, which occurs in most living things plus a certain number of undead, maybe not-quite-living (in the usual sense) things known as viruses. Thymine (T) is, of course, an essential ingredient of DNA. Thymidine kinase converts thymidine (thymine bonded to deoxyribose, on the left, below) to the phosphorylated form (TMP, right) so it can participate in DNA synthesis.  

It's important to note that thymidine (above, left) is not a synthesis product but a breakdown product. The biosynthetic pathway for TMP actually starts with dUMP, which is converted to TMP via thymidylate synthetase, an entirely different enzyme. Where does thymidine come from, then, and why does a cell need thymidine kinase? The answer is that thymidine is a breakdown product of DNA. When you eat plant or animals cells, the DNA in those cells gets broken down, and thymidine is one of the breakdown products. For cells, thymidine is a valuable product to have, so thymidine kinase recovers free thymidine via the reaction shown above. This is a scavenging pathway, in other words. Most organisms have it, but some don't (e.g., Pseudomonas lacks it).

When a virus attacks a cell, there's a lot of DNA turnover as the virus prepares to manufacture its own DNA, so thymidine kinase is a handy enzyme to have around if you're a virus. Many viruses, as a result, bring their own copy of the TK enzyme gene. But is the viral TK gene derived (in some kind of ancestral way) from the host cell's own TK gene? Not necessarily.

When you gather up the amino acid sequence data for a bunch of TK genes from bacteria and the phages (viruses) associated with them, you find that, phylogenetically speaking, the phage/viral TK genes are not very similar to the host genes.

Thymidine kinase phylo tree for enteric bacteria and their phages. (Click to enlarge.) Note that the phage genes (top cluster) segregate from the host genes in the lowermost branch. Interestingly, TK genes of various alphaproteobacteria of the Rhizobiales class cluster near the phage genes (much nearer than the enteric bacteria). This suggests a primordial origin of the phage genes rather than recent acquisition of TK from host cells.

In the above tree, phage (viral) genes cluster at the top. Host-cell kinases cluster at the bottom. The two clusters may be related to a distant ancestor (not shown), but one thing is certain: the phage versions of this gene are not simply a slight modification of the host gene. We know that's true because, remarkably, the thyK genes of certain alphaproteobacteria (Agrobacterium and its relatives) cluster with the phage genes, even though the bacteriophages are adapted to E. coli and Salmonella (and closely related enterics). In theory, the phage genes should cluster with the enteric bacteria, not with Agrobacterium and Rhizobium.

Where do the phage genes come from? Some have speculated that these phages originated with escaped bacterial secretion-system cassettes. That may well be, but the escape event had to have occurred many hundreds of millions of years ago. The T4 thymidine kinase gene has only 62% sequence homology with the E. coli gene. T4's version of the gene has a G+C content of 34%, with GC3 (third codon base) of just 19%. The E. coli version of the gene has overall G+C of 42% and GC3 of 36%. While it's generally conceded that evolution of viral genes occurs faster than host genes (particularly for RNA viruses and single-stranded DNA viruses), DNA viruses like T4 can't evolve outside of the host cell, as far as we know, and although DNA viruses may evolve faster than host DNA, they don't evolve thousands of times faster. (RNA viruses do evolve thousands of times faster, but that's an entirely different matter.)

To me, the above tree says that enteric phages diverged from the common ancestor of alphaproteobacteria and gammaproteobacteria (the latter include the enteric bacteria). We're talking hundreds of millions of years ago. (Note that E. coli and its closest relatives are thought to have diverged 140 million years ago.) A recent transfer of thymidine kinase genes from enteric bacteria to their phages is not credible. It's far more likely that an ancient precursor of today's T4 phage (and similar enteric phages) had the gene, and passed it down through the ages as the phages adapted to new hosts (first the alphaproteobacteria, then the phylogenetically newer enterics).

In tomorrow's post, I want to explain in detail how the above phylo tree was made and show you how you can make your own phylogenetic trees using the popular Mega6 program. If you've never used Mega6, you're in for a treat.  

Tuesday, May 13, 2014

Anaerobic Catalases and Codon Pseudodegeneracy

Today I want to demolish two myths, one being that strict anaerobes (such as members of the Clostridia, a class of bacteria that includes the botulism organism as well as the C. diff bug that sickens half a million people a year in the U.S.) lack a catalase enzyme, and the other being that codons have very little degeneracy in base one. Don't worry, I'll parse all this for you so that even if you're not a bio-geek, you'll get it.

Bacteriology students are taught from Day One that strict anaerobes are killed by oxygen because they lack a catalase enzyme. Catalase is the nearly universal enzyme that breaks down peroxides into oxygen and water. The catechism about anaerobes being catalase-negative is true for some anaerobes, but by no means all. In fact, if you go to UniProt.org and do a search on "catalase clostridium" you'll get a long list of hits representing anaerobes that do, in fact, have a perfectly normal catalase enzyme. Some anaerobes have the classic katG (E. coli) version of catalase while others (like C. difficile) have a special manganese-containing catalase. Pathogenic anaerobes probably use the enzyme to combat the respiratory burst that accompanies engulfment by white blood cells. I've written about anaerobic  catalases before; for an overview of the enzymology and phylogenetic breakdown by enzyme type, start here.

Methanospirillum (a strict anaerobe found
in sewage) produces colonies with
characteristic striations spaced two
cell-lengths apart.
I thought it would be interesting to compare the katG genes of two anaerobes to see how they differ. I decided to focus on C. botulinum B (the nasty food-poisoning organism) and the archeon Methanospirillum hungatei strain JF-1, which is named after a professor I had in grad school, Bob Hungate, a true pioneer in the study of anaerobic bacteria.

C. botulinum and M. hungatei are worlds apart, taxonomically. The genome of the former has a G+C (guanine, cytosine) content of just 27%, while the latter has G+C of 45%. Comparing their catalases, I found 717 nucleotide differences between the two genes (which were 2193 bases long in one case and 2157 in the other case). On the surface, the genes seem (from a DNA sequence point of view) quite far apart phylogenetically. But after performing an alignment in Mega6, I was able to study the codons base by base and found that of 717 differences, 500 changes were synonoymous (resulting in no amino acid substitution) while only 217 changes were non-synonymous. Not unexpectedly, most of the synonymous changes were due to differences in the third codon base (where the genetic code is highly degenerate; see below).

Codon Usage in Clostridium botulinum B
Codon Usage: The Standard Code (transl_table=1)
CCA(P) 1.45%
CCG(P) 0.10%
CCT(P) 1.00%
CCC(P) 0.06%

CGA(R) 0.08%
CGG(R) 0.01%
CGT(R) 0.18%
CGC(R) 0.02%

CAA(Q) 2.01%
CAG(Q) 0.25%
CAT(H) 1.10%
CAC(H) 0.16%

CTA(L) 0.77%
CTG(L) 0.10%
CTT(L) 1.67%
CTC(L) 0.06%
GCA(A) 2.66%
GCG(A) 0.20%
GCT(A) 2.18%
GCC(A) 0.19%

GGA(G) 3.30%
GGG(G) 0.47%
GGT(G) 2.08%
GGC(G) 0.40%

GAA(E) 6.40%
GAG(E) 1.11%
GAT(D) 5.12%
GAC(D) 0.62%

GTA(V) 2.79%
GTG(V) 0.45%
GTT(V) 2.81%
GTC(V) 0.14%
ACA(T) 2.36%
ACG(T) 0.16%
ACT(T) 2.22%
ACC(T) 0.19%

AGA(R) 2.43%
AGG(R) 0.31%
AGT(S) 1.89%
AGC(S) 0.48%

AAA(K) 7.10%
AAG(K) 2.17%
AAT(N) 6.06%
AAC(N) 0.89%

ATA(I) 5.68%
ATG(M) 2.50%
ATT(I) 4.01%
ATC(I) 0.42%
TCA(S) 2.10%
TCG(S) 0.12%
TCT(S) 1.68%
TCC(S) 0.12%

TGA(*) 0.02%
TGG(W) 0.69%
TGT(C) 1.02%
TGC(C) 0.25%

TAA(*) 0.24%
TAG(*) 0.07%
TAT(Y) 3.61%
TAC(Y) 0.55%

TTA(L) 5.68%
TTG(L) 0.74%
TTT(F) 3.74%
TTC(F) 0.57%
Pink-highlighted codons are examples of where the genetic code is degenerate for base one. 
A change of CGA to AGA (for example) results in no amino acid change: the 
coded-for amino acid is arginine in either case.


"Degeneracy" means a change in a particular base results in no change of amino acid. An example is the codon AAA (see above), which is the code for lysine (K). Changing the last base to G (resulting in a codon of AAG) still produces lysine in the resulting protein. Codons that begin with 'AA' are two-fold degenerate, because two versions (AAA and AAG) produce the same amino acid. A codon that begins with 'CC' is said to be four-fold degenerate, because any of four different endings (CCA, CCT, CCG, CCC) produce the same amino acid (P, proline).

What's less obvious is that a change in the first (rather than the third) base of a codon can produce synonyms as well. For example, changing CGA to AGA results in the same amino acid: arginine (R). Likewise, CTA changing to TTA still results in leucine (L).

But there's more degeneracy to  the first codon base than just the classic cases outlined in pink in the above table. A change of CTA to ATA produces a switch from leucine to isoleucine, which is tantamount to no change at all, because leucine and isoleucine are isomers (both have the same chemical formula; they differ slightly in arrangement).


Likewise, a change from CTx to GTx produces a change from leucine to valine, which are workalikes. One could even argue that a change from CTT or CTC to TTT or TTC (leucine to phenylalanine) is functionally a degenerate change, because leucine and phenylalanine are nonpolar, hydrophobic amino acids with similar chemical properties.

So in fact, many changes to the first base of a codon can be considered pseudodegenerate, if you will, by virtue of producing non-synonymous changes that are tantamount to very little actual chemical change. We would expect to see a significant number of such changes in mutations affecting base one of codons. And we do.

In the two catalase genes, I looked at changes involving just the first base of a codon and found the following changes:

Synonymous:
AGA<-->CGA (R:R)
AGA<-->CGA (R:R)
AGA<-->CGA (R:R)
TTA<-->CTA (L:L)
TTG<-->CTG (L:L)
TTG<-->CTG (L:L)
TTG<-->CTG (L:L)
TTG<-->CTG (L:L)

Non-Synonymous:
AAA<-->CAA (K:Q)
AAA<-->GAA (K:E)
AAA<-->GAA (K:E)
AAG<-->CAG (K:Q)
AAG<-->CAG (K:Q)
AAT<-->CAT (N:H)
AAT<-->GAT (N:D)
AAT<-->GAT (N:D)
ATT<-->CTT (I:L) *
ATT<-->GTT (I:V) *
CAT<-->GAT (H:D)
CTG<-->ATG (L:M)
CTT<-->ATT (L:I) *
GCA<-->TCA (A:S)
TAT<-->CAT (Y:H)
TCA<-->GCA (S:A)
TCT<-->CCT (S:P)
TTT<-->CTT (F:L) *


Notice that 4 out of 18 non-synonymous changes (see asterisks) fall in the category of pseudodegeneracies. That's slightly higher than the 17% we would expect by chance. So instead of 8 out of 26 first-base changes being degenerate (in the catalase genes), more like 12 out of 26 are either synonymous or near-synonymous. Bottom line, roughly half of first-base mutations are effectively synonymous, at least in this example.

Monday, May 12, 2014

Problems with the Tuberculosis Genome

These days, most genes, in most sequenced genomes, are machine-annotated with a minimum of human intervention, and as a result around 30% of gene annotations are inaccurate, either as to functional assignment or as to reading frame.

It's not hard to find serious errors in published genomes. In the genome of Mycobacterium tuberculosis ATCC 35801, for example, there's a 17-kilobase-pair section of the genome that's full of errors (see below).

A view of two tuberculosis bacterial genomes showing a 17,000-base region (denoted by pink) of 100% sequence similarity. Notice that even though the genomes are 100% sequence-identical in the region, the number, strand orientation, and sizes of genes differ. The yellow gene in the top panel is identified by GeneMarkS+ as "secretion protein EspK," while the yellow gene in the lower panel is identified as DNA Ligase (ligA) and is much smaller (and occurs on the opposite strand).

In this graphic, the same region is shown for two strains of M. tuberculosis. On top is M. tuberculosis Strain EAI5 and on the bottom is M. tuberculosis ATCC 35801. To browse the genomes in your own browser, go to this link and click the pink Run GEvo Analysis! button. When the panels appear, you'll be able to click on individual genes to see what they (supposedly) are.

The yellow-colored gene in the top panel is annotated as "secretion protein EspK," while the corresponding region in the bottom genome has two genes (green and yellow) on opposing strands of DNA, with one gene (in green) given as a "hypothetical protein" and the other (in yellow) as "DNA ligase." Note that the entire region covered by the pink bands is 100% identical from organism to organism in terms of DNA sequence. And yet one annotation program found 13 genes while the other found 17.

Sadly, this is not an unusual situation.

How can you tell which annotations are correct? Unfortunately, it takes some investigation. Consider the two yellow genes. They can't both be correct as shown, because one is twice as long as the other and each is on a different strand of the DNA! And yet, if you obtain the DNA sequence of each, and use it as a query sequence in a BLAST search, you'll come up with "good" hits for each gene (because there are other incorrectly identified genes in public databases, matching each query).

It turns out the "secretion protein EspK" (the large yellow gene in the top panel) is correctly identified. To determine this, I downloaded the FASTA sequence for the large gene, plus the sequence for the same gene in the same location in the M. canetti CIPT genome (identified as "hypothetical alanine and proline rich protein"), plus the same gene in M. canetti CIPT 140070010, with the idea of identifying the differences in the (aligned) DNA sequences with respect to their codon positions.

When I had a script identify mutations by codon location, the number of differences at the three possible codon locations, were:

base 1: 4 changes
base 2: 2 changes
base 3: 19 changes

This is exactly the pattern you would expect if the reading frame is correct. Base 3 (the wobble base) tends to accumulate the majority of mutations, because changes in a codon's third base usually result in no amino acid change (due to the genetic code's degeneracy). Base 1 has a small amount of degeneracy and thus has the next-highest number of mutations. Base 2 has no degeneracy; all changes to this base result in a different amino acid. (All changes are non-synonymous, in other words.) 

I repeated this check using the so-called "DNA ligase" gene of Mycobacterium tuberculosis ATCC 35801, but using sequence data from the bottom strand of DNA, with 1329 bases' worth of additional 3'-end sequence data in order to cover the same amount of DNA as the other sequences. I got difference data of 3, 1, and 8 for bases one, two, and three. This verified that the active gene is actually on the bottom strand.  ERDMAN_4254 is incorrectly annotated as a top-strand DNA ligase. Instead of two genes, ERDMAN_4253 and ERDMAN_4254 (on two opposing strands), Mycobacterium tuberculosis ATCC 35801 should be shown as having a single large gene on the minus-one strand. The correct identification (based on more than thirty E=0, 100% identity hits in other strains) is, in fact, "secretion protein EspK."

Unfortunately, at least 3 other tuberculosis strains, namely M. tuberculosis GuangZ0019, plus Strain CCDC5180 and Strain CCDC5079, also harbor an incorrect ligA annotation, and Mycobacterium isn't the only organism with a "bogus ligA" problem. Micromonospora sp. M42 also has a fake ligA gene and I'm certain there are others.

Bottom line, don't believe everything you read in genome annotations. The annotation may say "DNA ligase," but you could actually be looking at something else entirely.




Sunday, May 11, 2014

Making Sense of Antisense

One reason genomes are so poorly annotated is that annotation software (of the Glimmer variety) gets easily confused by high-GC-content genome data. When a genome is high in guanine and cytosine, relatively few stop codons are present in alternate reading frames. (Recall that DNA is read in triplets of letters, called codons: AAA, AAC, AAT, AGT, etc. There are 6 possible reading frames for any given segment of DNA, representing 3 forward reading frames and 3 backward frames.) Stop codons (TGA, TAG, TAA) are mostly composed of A and T, not G and C. But also, it so happens that most protein-coding genes follow a certain pattern of codon construction. The first base in a 3-letter codon is usually A or G (about 60% of the time, in most genes, in most organisms). The second base is highly variable in all respects. The third base is usually reflective of overall genome composition: If the genome is high in G and C, the third base of each codon will tend to be G or C. (This happens almost all the time in Streptomyces, for example, where the third base G+C content is 97%.) If the genome is high in A and T, the third codon base will be high in A and T.

What's perhaps unexpected is that the same compositional pattern can sometimes work in reverse, on the complementing DNA strand. When you look at a protein's codons and see that 60% use a purine in the first base and but only 40% use a purine in the third base, this means that the reverse complement of the codon also has 60% purine content in base one and 40% purine content in base three. Example: the codon GCT (alanine) has a purine (G) followed by a pyrimidine (C) followed by a pyrimidine. The reverse complement codon, AGC (serine) also begins with a purine (A) and ends with a pyrimidine (C). This type of symmetry tends to be a confounding factor for programs like Glimmer that try to distinguish sense from antisense strands and coding from non-coding regions, and normal reading frames from nonsense frames.

Perhaps some real-world data will make this clearer. Below is a plot of AG1 (purine content at base one) versus GC1 (guanine plus cytosine, base one) for all codons of all genes of the soil bacterium Pseudomonas fluorescens PF0-1. Each point represents one gene's worth of data. For each data point, I simply went through all of that gene's codons and tallied up the A, G, C, and T at each base position, then found the average AG1 and GC1 for the gene in question. I did this for all 5,722 protein-coding genes in the genome. (Don't worry. Scripts do the whole thing in the blink of an eye. It takes less than 10 milliseconds to process one gene's worth of data.) Notice how the points cluster at y=0.6, meaning most genes have an average AG1 (first-base purine) content of around 60%.

AG1 (purine content, base one) versus overall G+C content, for N=5722 protein-coding genes of Pseudomonas fluorescens PF0-1. Each dot represents statistics for one gene. Notice that the dots tend to cluster at about y=0.6.

Now have a look at the graph below, which is the same kind of plot except we're looking at data for the third codon base. Here, the median y-value is only 0.453, meaning that purine content averages around 45% in the third base. That means on the opposite strand, in the same position, there's a purine ~55% of the time.
AG3 (purines, base three) versus G+C for all protein-coding genes (N=5722) in P. fluorescens PF0-1. The median y-value is 0.453. The long comet tail in the direction of low G+C could be due to "AT drift." It could also partly be due to incorrect annotation of genes as to "sense" versus "antisense" strands.

I ran some numbers and found that in P. fluorescens, 74% of genes have codons that are purine-heavy on the front (AG1 greater than 55%), in the normal reading direction, but most genes (53%) are also purine-heavy in base one when read in the reverse-complement sense. In fact, for every three protein genes that have AG1 greater than 55% and GC3 above 60% in the normal reading frame, there are two genes that meet the same criteria when translated in the reverse-complement frame. This means that for quite a few genes in Pseudomonas, the normal reading frame has similar compositional statistics to the reverse reading frame. (Note that codon base two tends to average 48% purines in the forward direction and 52% in the back direction.) To a program like Glimmer, many protein genes look surprisingly similar whether read from the sense strand of DNA or the antisense strand. Distinguishing sense from antisense is not trivial, in other words, although in an upcoming post I'll talk about a way to do it via codon bias.

To get a better idea of how universal this bidirectional codon symmetry might be, I obtained the codon statistics for 109 organisms and calculated the average base-3 composition stats, and came up with the following graph that plots AG3 (purines, base three) against overall genome A+T content:
Codon base three purine content versus genome A+T content for N=109 bacterial species. Every point represents aggregate codon stats for one organism. The size of each circle is proportional to genome size of the organism.

The size of each circle is proportional to the genome size of the organism in question. (As you can see, organisms with high genomic A+T content often tend to have smaller genomes.) Notice that base three tends to be a pyrimidine (AG3 less than 50%), in the normal reading frame, for almost all organisms. (That means it's a purine in the reverse reading frame.) A few circles appear above y=0.5, but not many, and not by much.

Bottom line: Distinguishing sense from antisense DNA is not a straightforward matter, since many codons have similar composition statistics in forward and backward frames. Breaking the deadlock might require finding a Shine Dalgarno sequence for one frame but not the other, or it could mean running a homology check (via BLAST) against similar genes in a database, but in that case you have to hope the strand assignment was correct in the database genes (which it often is not).

Incidentally, I know of no a priori reason why the wobble base should accumulate pyrimidines preferentially (even though that's what the data say). In theory, base three is (for most codons) a degenerate position and should be neutral (free to accumulate bases of any type). We know that base one tends to be a purine 60% of the time. The fact that base three is a pyrimidine 54.7% of the time is suspicious (arguably) and tends to imply that some genes are annotated backwards. As we'll see in a future post, the backwards-annotation problem isn't a huge issue in most organisms, but it's not negligible, either.

Saturday, May 10, 2014

The "Hypothetical Protein" Problem

Once a genome has been sequenced, identification of "coding" regions of DNA (and assingment of functions to those regions) is almost always done by software. The Glimmer (Gene Locator and Interpolated Markov Modeler) freeware package, available from Johns Hopkins University, is a widely used annotation system. It finds putative genes, identifies which strand they're on, locates start and stop codons, etc.

The technology is suprisingly good (Glimmer reputedly finds 99% of genes) but it also has serious limitations. Programs like Glimmer have a relatively easy time locating open reading frames in low-GC (low guanine, cytosine) genomes but have a more difficult time identifying the "sense" strand in high-GC genomes. Also, gene size matters: Glimmer is great at finding sizable genes but is less accurate with small genes. These shortcomings are not unique to Glimmer but apply to all gene-detection software that I'm aware of.

In almost every bacterial genome, 20% to 40% of genes cannot be identified as to function and are tagged "hypothetical protein." These genes tend to differ from other genes in various ways. Size, for example.

Gene size distribution for N=3412 genes of assigned function (blue) and N=793 "hypothetical protein" genes (red) in E. coli B.
The above graph shows the size distribution of protein-coding genes in E. coli B. Hypothetical protein genes (N=793) are shown in red; genes with an assigned function (N=3412) are in blue. You can see that at gene sizes below ~500 base pairs, genes tend (more often than not) to be labeled "hypothetical protein." In E. coli, hypothetical-protein genes also tend to have lower GC content in the third codon base (the so-called "wobble" base). Whereas GC3 has a median value of 56.8% in genes with assigned functions, GC3 has a median value of 51.6% in hypothetical proteins.

In most genes, codons tend to begin with a purine base (A or G) 60% of the time. In E. coli, it's 59.8% of the time. Genes in which AG1 is less than 50% are rare. In E coli, only 3.7% of function-assigned genes have AG1 under 50%, but 6.9% of hypothetical protein genes have AG1 under 50%, and hypotheticals are three times as likely to have AG1 (purine content, codon base one) less than AG2 (purine content, base two). The latter is an important "sanity check" of reading frame. One would only expect AG1 to be less than AG2 in genes with a frameshift error or incorrect reading-frame assignment.

If 6.9% of hypothetical protein genes in E. coli are in the wrong reading reading frame, that's over 100 genes. (Hypothetical genes comprise 19% of E. coli's 4205 CDS genes.)

When I made a cursory spot check of "hypothetical protein" genes in E. coli B, I quickly found a number of genes (ECB_00841, ECB_01484, ECB_01676, ECB_02804, ECB_03339, and others) that were incorrectly designated as to the sense versus antisense strand. These genes produced more and better BLAST hits when reverse-translated than when forward-translated. Some of them had stop codons in the middle, but that's okay. That just means they're pseudogenes. It's vitally important to correctly identify pseudogenes. (In most bacterial genomes, pseudogenes are vastly underreported, because they're hard to distinguish from non-coding regions.)

Bottom line, there's ample reason to believe a substantial fraction of the 793 "hypothetical protein" genes in E. coli are in fact either in the wrong reading frame, are pseudogenes, or have an improperly located start codon (or harbor other serious errors).

I don't think E. coli is exceptional. I'm convinced misannotation is a serious issue affecting 20% or more of all machine-annotated genes. Some annotation programs do a better job than others of hiding problems like overlap resolution and small-gene detection, but the problems are there, bigtime, and aren't likely to go away any time soon.

Friday, May 09, 2014

A JavaScript Tip for Biohackers

Today I want to give a short bio-hacking code lesson, and I want to keep it as simple as possible in case you're just starting to use JavaScript and want to see what kinds of things it can do. This lesson could prove handy to you even if you're not a gene hacker, because it shows how easy it is to manipulate text in the browser.

For this example, I am going to assume that you have a text file open in the browser. In particular, I'm going to assume the text file consists of a listing of two genes sequences in FASTA format, which is actually very simple: FASTA consists of a header line that starts with the > (greater than) angle bracket, followed by one or more lines of AGTC base sequence data (if it's a nucleotide sequence) or a bunch of letters (KPRMIV etc.) if it's a protein sequence. For example:

>M. leprae MBLr_2224...    <-- this is the header
ATGGCGGTGCTGGATGTC...      <-- this is the data

Your file might have several genes in FASTA format, one after the other. The question is: How can you read this data with JavaScript?

Actually, it's extremely easy to read text data. Open any text file with your browser, then open a JavaScript window: With Firefox, use Shift-F4 to bring up the Scratchpad, or with Chrome use Control-Shift-J. To read the text into a JavaScript variable called text, just type the following line into the script editor:

text = document.body.textContent;

Execute this code with Control-L in Firefox or by simply entering a carriage return in Chrome. (If Firefox fills your Scratchpad with text, bracketed by /* and */,  that's good; it means the code worked. Delete it and proceed.)

Suppose you have several FASTA records in a row and you want to have an array of gene data. The easy thing to do is "split" the records at each header, discarding the header:

r = />[^\n]+\n/g;
genes = text.split( r );

NOTE: To enter more than one line of code in the Chrome console, you have to hold the Shift key down before hitting Enter. Otherwise, Enter executes the code.

The first line defines a regular expression (r) for the pattern: "greater-than symbol followed by one or more non-newline symbols, followed by a newline." (The caret symbol ^ means to negate whatever's in between the square brackets.) When this code executes, genes will be an array of data, but because of the way split() works, the first item (item zero) in the array will be empty, so get rid of it with:

genes.shift();

Now the genes array will contain gene data. The data for gene No. 1 will be in gene[0], the data for gene No. 2 will be in gene[1], etc.

Incidentially, if you want an array of headers, just do:

headers = text.match( r );

No need to do headers.shift(). The match() operation creates an exact array.

If your genes are aligned, you can compare them, base to base, in a loop. (If your genes are not aligned, create an alignment using an online ClustalW alignment tool or using the popular Mega6 program.) The following loop construct compares the first 300 bases in two genes, and tallies the differences according to whether the difference occurred in codon base one, base two, or base three:

snp=[0,0,0]; // array to hold base 1,2,3 results
gene1 = genes[0];
gene2 = genes[1];
 
for (var i=0; i < 300; i++)
   snp[ i % 3 ] += gene1[i] != gene2[i];

You can display the results in the console simply by adding (on its own line) snp; or console.log(snp). Or if you want to see it in a dialog, execute alert(snp).

The final line of code deserves explanation. Results are placed in the snp (single nucleotide polymorphism) array according to whether the "hit" occurred in base 1, base 2, or base 3 of a codon. The i % 3 construct (i modulo 3) means divide i by 3 and throw away the "answer" but keep the remainder. (So for example, 5 % 3 equals 2, 6 % 3 equals zero, 7 % 3 equals 1, etc.) As i increments, i % 3 simply takes on values of 0, 1, 2, 0, 1, 2, etc.

On the right side of the equals sign we have gene1[i] != gene2[i]. This means we want to compare the two genes at an offset of i, and if they are not equal (!=), tally the resulting value (true, or 1) numerically. The numeric result is added to the appropriate slot in snp. Note that JavaScript treats true as having a numeric value of one and false as having a numeric value of zero. It knows to cast the boolean value to a number because of the += symbol (which means numerically add the following value to the existing value of the lefthand variable).

I hope this short tutorial wasn't too painful. If you're new to JavaScript, my advice is: Experiment in the console (Firefox's Scratchpad or Chrome's JS console) a lot, and get familiar with the string functions split() and match(), because they can be incredibly useful, not to mention speedy. Either of those two functions can take a string argument or a regex. But remember to put a 'g' after the regex if you want the operation to occur globally throughout the string. For example:

"abcdefabcdef".match( /abc/ )

will only match the first occurrence of abc, whereas

"abcdefabcdef".match( /abc/g )


will match both occurrences of abc and give you an array of matches.

Thursday, May 08, 2014

Ancient Drug Resistance Genes

Antibiotic-resistant bacteria have been in the news lately, with the usual scary talk about drug-resistant bacteria taking over the world, with a return to the Dark Ages if we don't Do Something.

Certain facts about drug resistance tend to get lost in these sorts of news stories. There's a common misconception that somehow the introduction of antibiotics (first in medicine, then in agriculture) caused bacteria to invent new drug-resistance genes out of nowhere. That's nonsense, of course. The Lederbergs showed in 1951 that drug resistance genes were/are preexisting, and did not come into being because of antibiotics. Rather, bacteria already equipped with such genes were able to survive treatment with antibiotics. Around 1968 it also became clear that bacteria can share drug resistance genes by exchange of extrachromosomal DNA (plasmids). Bacteria that don't have the magic genes can get them from their buddies.

But, so. How is it that the genes for antibiotic resistance already existed prior to the introduction of antibiotics into the food chain and into medicine? The answer is simple: Antibiotics are natural products produced by common soil and water microorganisms. Penicillin is produced by a mold; streptomycin is produced by the soil bacterium Streptomyces. Certainly hundreds (maybe thousands) of different kinds of antibiotics exist in the natural environment. They've been there for millions of years.

Mycobacterium tuberculosis ATCC 35801 (Erdman strain) has two "multidrug  resistance" genes, and they reside on the main chromosome (not on a plasmid). They're shared by other members of the Mycobacterium genus, including the leprosy bacterium, M. leprae. In the latter, one of the genes in question (MLBr_2224) is a pseudogene. It's interesting to compare this pseudogene with its counterpart, Erdman_0866, in Mycobacterium tuberculosis ATCC 35801. The two genes are nearly the same size: 1543 base pairs versus 1638. This is quite interesting in itself, inasmuch as most pseudogenes in M. leprae are truncated (averaging just 795 base pairs). But the comparison gets even more interesting when one does a side-by-side analysis of single nucleotide polymorphisms (changes to individual bases).

First I did a SNP comparison of the tuberculosis drug-resistance gene to its counterpart in M. indicus pranii, where there were 325 total base-pair differences, distributed amongst the 1st, 2nd, and 3rd base pairs of codons as:

98, 72, 155

This is the expected pattern: The largest number of mutations tends to accumulate in the third codon base (the so-called "wobble base"), because mutations in this base give a large percentage of synonymous codon changes owing to codon degeneracy. The next-highest number of mutations is expected in the first base, because there is some (but not much) degeneracy in that position. All mutations in the second base are non-synonymous and likely to affect enzyme function. Hence, the lowest number of mutations occurs in the second base.

When I ran the same check between the M. tuberculosis gene and its counterpart (the pseudogene) in M. leprae, I found that the mutational differences totaled 418 changes and segregated by base as:

119, 103, 196

Unexpectedly, the same pattern emerges. The reason this is unexpected is that one expects a pseudogene to contain frameshifts that would destroy the reading frame, mooting 1st/2nd/3rd-base comparisons. More generally, one expects massive random mutations all along a pseudogene's length (of the kind that would tend to make all three of the above numbers equal, even without frameshifts). So the fact that we still see the familiar pattern of 1st/2nd/3rd-base mutations is interesting.

The genes in question were unequal in length, so to do these comparisons I had to disregard a 112-base leader portion of the aligned genes. (I aligned the genes via ClustalW using Mega6.) I also ignored the unequal-length trailer portions of each gene, analyzing only the interior (aligned) portions from base 112 to base 1418. Also, to facilitate the analysis, I removed one adenine (representing a one-base insertion) in the M. leprae gene, at position 407, to restore the reading frame. But it's interesting that at that position there's a run of six adenines, indicative of a slippery site, of the kind associated with frameshift signalling.

The M. leprae pseudogene MLBr_2224 behaves as if it is still a normal gene, accumulating mutations in the "right places." An alternative explanation is that the M. leprae and M. tuberculosis genes were already in pretty much this orientation relative to each other before the two species diverged, and M. leprae simply accumulated very few additional mutations in the pseudogene after it became a pseudogene. (M. leprae is assumed to have experienced a massive pseudogenization event between 9 and 20 million years ago.)

For more about M. leprae's strange assortment of pseudogenes, see this post and also this post.