Showing posts with label DNA. Show all posts
Showing posts with label DNA. Show all posts

Tuesday, May 20, 2014

More Secrets of the Virus World

It's generally conceded that viruses evolve more rapidly than host cells, but the rates vary tremendously depending on the type of virus. Generally, large DNA viruses that infect algae (the phycodnavirus family) are considered to have some of the slowest rates of change, whereas the fastest-to-change viruses tend to be small RNA viruses that infect animal cells (e.g., HIV). In terms of substitutions per nucleotide per cell infection (s/n/c), one recent study found rates of 10−8 to 10−6 s/n/c for DNA viruses and 10−6 to 10−4 s/n/c for RNA viruses, which means the fastest-mutating viruses change 10,000 times faster than the slowest-mutating viruses.

Given the ultra-rapid rate of change of RNA viruses and their generally impressive level of adaptation to host-cell environments, one might expect a virus like HIV-2 to show a codon usage bias similar to that of the host. And that's approximately true.

HIV-2 codon usage (left), in DNA format (T for U), versus overall human-cell codon usage (right).

The above graph shows codon usage for HIV-2 on the left and codon usage for human cells on the right. (HIV is an RNA virus, but codons are shown here in DNA format, with T in place of U.) R-squared/adjusted comes to 0.2204, so we can't very well say confidently that the codon values are highly correlated. But if you look at the smaller bars (not the "peaky" ones), they tend to taper down on the left, just as on the right.

It might be instructive to go from one of the fastest-changing viruses in the biosphere (HIV) to one of the slowest, and see how its codon usage compares to that of its host. This time, we're looking at the large DNA virus known as PBCV-1 (left) versus its Chlorella host (an alga, right):

Codon usage in Paramecium bursaria Chlorella virus 1 (PBCV-1), left, and Chlorella variabilis strain NC64A, right.
These two data sets are not only not correlated, they appear to be anticorrelated, which is quite unexpected. Bear in mind, PBCV-1 is relatively large, with a genome of 330,601 base pairs encoding hundreds of proteins (and ten tRNAs). Thus the pattern shown here isn't likely to be random noise. Note that PBCV-1 has a genomic G+C content of 40%, versus 61% for the host, which is a pretty sizable separation. It's almost as if PBCV-1 has spent part of its life coexisting with an entirely different host.

Which brings me to the final and most intriguing (I might even say shocking) graphic, which compares codon usage in PBCV-1 virus with codon usage in Chlorella's own host, Paramecium.

Codon usage in PBCV-1 virus (left) and Paramecium (right).
Recall that when it is not free-living on its own, the tiny unicellular Chlorella alga has an endosymbiotic relationship with the comparatively much larger unicellular ciliate protist, Paramecium. That is to say, Chlorella can live inside Paramecium. Chlorella allows Paramecium to thrive in high-sunlight/low-nutrient conditions, whereas Paramecium, in return, gives the non-motile Chlorella free transportation and protection against viruses. (PBCV-1 can infect free-living Chlorella, but does not infect Chlorella living inside Paramecium.) As far as I know, no one has ever reported that PBCV-1 virus can infect Paramecium. Supposedly, it infects only free-living Chlorella And yet, we find that the pattern of codon usage in PBCV-1 is very strongly correlated with the pattern of codon usage in Paramecium. (R-squared/adjusted: 0.527.)

Paramecium filled with Chlorella cells.
This chart is a real shocker from a couple of standpoints. First, as I say, PBCV-1 virus is not known to infect Paramecium. And yet codon usage patterns in the virus are much more closely aligned to Paramecium's patterns than to Chlorella's. Notice that AAA is the No. 1 most-used codon in PBCV-1 as well as Paramecium. Seven of Paramecium's top ten codons are in PBCV-1's top ten.

Secondly, Paramecium doesn't use the standard genetic code! It uses the Ciliate Code (Translation Table 6), in which TAA and TAG encode glutamine instead of serving as stop codons. (TGA is the one and only stop codon in Table 6.) If Paramecium used the standard genetic code, the alignment of the two organisms would be even stronger.

Also interesting is that PBCV-1 and Paramecium are quite far apart in G+C content (the former is 40%, the latter is 28%).

Perhaps at some point in its past, PBCV-1 had a wider host range, one that included Paramecium. It's possible that even today, it has hosts other than Chlorella that have yet to be observed experimentally. Certainly, the pattern of codon usage is consistent with such an idea.

Sunday, May 18, 2014

Virus Genes Don't Come from Host Genes

There's a school of thought that says that viruses originated as escaped constellations of host genes. Virus genes have to originate from somewhere. One theory is that they started with host genes.

Trouble is, there's precious little evidence that viral genes originated from host genes, and plenty of evidence to the contrary. It may actually be that host genes came from viruses.

To say viral genes derive from host genes is like saying hemorrhoids derive from earlobes. Any resemblance is, at best, superficial.

In a previous post, I showed data for the relatively large phylogenetic distance between thymidine kinase genes in phages (viruses that attack bacteria) and their hosts. In one case, I showed that prophage genes (genes from viruses that have succeeded in integrating into the host DNA) are more similar to host genes than lytic-lifestyle phages, but even in the case of temperate phages, I think we have to be honest and say that a prophage is still an example of foreign DNA integrating into a host. (Prophage genes can usually be easily identified by their base composition, which differs noticeably from the base composition of host genes.) Once a prophage becomes fully integrated into the host, its DNA (under the influence of the host repairosome) will tend to ameliorate, taking on the base composition and other characteristics of the host DNA, making it superficially similar to host DNA.

What "other characteristics" does ameliorated DNA take on? Consider codon usage patterns. Recall that the genetic code is set up in such a way that most amino acids correspond to more than one codon (three-letter pattern) in the DNA. Leucine, for example, can be encoded six different ways (namely by base patterns CTA, CTG, CTT, CTC, TTA, and TTG). Likewise, alanine can be encoded four ways (GCA, GCG, GCT, or GCC). But specific organisms develop specific patterns of codon use, preferring certain synonyms over others. For example, Clostridium botulinum (the food-poisoning bug) overwhelmingly prefers to use TTA for leucine (rarely using the other 5 synonyms), whereas E. coli strongly prefers to use CTG (choosing it four-to-one over the next-most-used leucine codon). These codon preference patterns are highly specific to a given species and are thought to be related to the numbers and types of available transfer RNAs (tRNA) in the cell, although frankly it's still an open question whether codon usage adapted to tRNA availability or the reverse.

The idea that viruses mutate rapidly and evolve in close harmony with the hosts on which their reproduction depends suggests that virus codon preferences should match those of the host. (This would be particularly true if virus genes come from host genes.) Remember that a virus has no ribosomal machinery and must rely on the host's protein-making equipment in order to survive. Therefore it would make sense for a virus to adapt its codon usage patterns to the patterns most favored by the host equipment.

That's not what we find. When we look at the codon usage patterns of phage T4 (a classic enterobacterial phage) versus E. coli's codon usage, we find that they differ substantially:

Codon usage frequencies for T4 phage (left) and E. coli B (right).
In this graphic, host-cell codon usage frequencies are on the right while corresponding T4 virus frequencies are on the left. Note that the T4 chromosome encodes 274 protein genes, encompassing over 50,000 codons, so the graphic is based on fairly solid numbers; variations from E. coli can't be accounted for simply by statistical noise.

T4's codon preferences are so different from the host cell's, the T4 phage brings with it genes for 8 types of tRNA. But the differences in codon usage go well beyond 8 codons, so the presence of tRNA genes in T4 DNA doesn't, by itself, explain the divergence of the data.

But what about temperate phages, like Fels-2 (a prophage in the Salmonella genome)? Since prophage genes are, in effect, a permanent part of the host genome, we would expect to see some amelioration of codon usage. And in fact, that is what we do see:

Codon usage in Fels-2 phage (left) and Salmonella typhimurium LT2 (right).
Here, we see that the codon usage patterns of Fels-2 and its host are quite similar. The differences are easily accounted for by the fact that Fels-2 has only 47 protein-coding genes, and the amino acid composition of those genes is probably different enough from "average" host genes to sway the usage stats to the degree shown here. Nevertheless, codon usage patterns aren't sufficient to tell us where Fels-2 genes came from originally. That's still an open question. Like the Martians in War of the Worlds, Fels-2 genes probably came from "somewhere else."

Robbie: "What, you mean, like Europe?"

Tom Cruise character: "No, Robbie. Not like Europe."

Saturday, May 17, 2014

Evolution of Prophage Genes

Viruses have two modes of reproductive existence. In the familiar lytic cycle, the virus infects a cell, replicates itself until the cell bursts, and hundreds (or thousands) of virions are produced. But there is also a lysogenic mode of viral existence, in which the virus inserts a copy of its DNA in the host's own DNA. The viral DNA thus inserted becomes known as a prophage, which can remain dormant for long periods of time. The prophage can often be induced to enter a lytic cycle by exposure of cells to hydrogen peroxide or Mitomycin C. (Induction of phages in this fashion is thought to occur when a phage repressor protein is cleaved by recA after the latter is upregulated in the SOS response.)

Prophage genes are seen in a wide variety of bacteria (a 2008 paper estimated that over 60% of bacterial genomes contain prophage genes), and in fact human DNA is thought to contain at least 8% retroviral gene remnants. There's reason to suspect that certain large DNA animal viruses (such as herpes and vaccinia) have a lysogenic cycle. Certainly, viruses like varicella zoster (which can produce shingles many years after a person's initial infection) can remain dormant for decades before suddenly undergoing induction to a lytic phase.

Viruses that live an exclusively lytic lifecycle have relatively few opportunities to co-evolve with the host, because they spend little time in the host. Such a virus might spend years "hanging around" in the environment before encountering a host cell; then the lytic reproductive cycle may last only minutes or hours, and it's back to "hanging around" in the environment.

The situation is much different for a temperate virus (i.e., one that has a lysogenic cycle). A lysogenic virus essentially becomes an integral, first-class component of the host DNA and undergoes the same replication and repair processes that apply to host DNA. Accordingly, we should expect to see a much different pattern of evolution in the genes of lysogenic viruses (or prophages). And indeed we do.

The phylogenetic tree below was prepared using viral (phage) and bacterial genes for DNA adenine methylase (dam), an enzyme involved in DNA repair and replication. What's interesting about this gene is that many bacteria have their own (native) copies of this gene plus a prophage copy. And they differ, but not as much as, say, lytic-phage thymidine kinase versus native bacterial TK. (I showed phylo-trees for viral and bacterial TK enzymes in a prior post. If you'll recall, these enzymes differ so drastically that it's not at all clear that one derives from the other, ancestrally.)

DNA adenine methylase genes from three enteric bacteria and two phages (marked with asterisks). The top branch shows very close homology between prophage genes and their bacterial paralogues. The bottom branch shows that the native bacterial isoform of the enzyme is not as closely related to the prophage version(s).
With the dam genes, we see an interesting segregation pattern. There are two main branches to the phylo-tree. In the upper branch are the phage dam genes along with bacterial paralogues of these genes. The bottom branch shows how the non-paralogous (non-prophage) dam genes segregate.

To make these relationships clearer, here's a chart showing the overall G+C content as well as the GC3 (G+C content at codon base 3) for the various genes. The entries shaded in grey represent prophage genes. Notice that the G+C percentages are significantly lower for the prophage genes, but are higher than in free-living lytic-cycle phages (where GC3, in particular, is often less than 20%).


DNA adenine methylase genes for enteric bacteria and their temperate phages. Base-composition stats for prophage isoforms are shown in grey.
Organism Gene G+C GC3
Shigella sp. strain D9 ZP_05434596.1 49.60% 55.90%
E. coli EHW52521.1 49.60% 55.90%
S. enterica AAL22346.1 49.10% 54.50%
E. coli EHW55384.1 47.20% 47.90%
Salmonella phage RE-2010 YP_007003503.1 46.50% 46.20%
Shigella sp. strain D9 EGJ07993.1 46.20% 46.30%
S. enterica ETB92379.1 46.20% 43.00%
Fels-2 phage YP_001718754.1 46.20% 43.00%

If you compare the phylo tree shown further above with the phylo tree in my earlier post about thymidine kinase genes, you'll note that the prophage dam genes cluster very tightly with bacterial versions of these genes. That's because, as a fully integrated part of the genome, the prophage genes benefit from the host's DNA repairosome. They evolve gradually over long periods of time by the usual mechanisms. The genes are notably host-like because they're continuously repaired and groomed in the same manner as host DNA.

The takeaway here is: If you create a phylo-tree for a set of genes from hosts and viruses, and the genes cluster tightly with host versions, you're probably looking at the result of longterm lysogeny. On the other hand, if the virus genes do not cluster with host genes (as they usually don't!), that means you're looking at viruses that have a predominantly lytic mode of existence; viruses that probably got their genes from a far-distant ancestor of the modern-day host, if not from a primordial precellular precursor of some kind.

Wednesday, May 14, 2014

Where Do Virus Genes Come From?

There's a memorable moment in War of the Worlds (Spielberg version) when the Tom Cruise character tries to explain to his son, while driving a minivan, that they have to leave town immediately because evil, marauding machines "from somewhere else" are on the rampage. Robbie (the son) says: "What, you mean, like, from Europe?"

That's the image that comes to mind when I try to explain where virus genes come from. They don't come from the mother ship (the host). Oh sure, in some cases they clearly do derive from host genes. But in most cases, they clearly don't. The overwhelming majority of viral genes have no counterparts in host cells, and even for those that do, the genes in question are rarely true host orthologues.

"No, Robbie, not like Europe."
Where do they come from, then?

Short answer: Somewhere else.

Using a program like the excellent (and free) Mega6 ("Molecular Evolutionary Genetic Analysis") you can easily create phylogenetic trees from genetic data (FASTA sequences, protein or DNA), and when you do this for genes that occur in both viruses and host cells, you can see how they separate in phylo-space.

Example: I decided to look at the gene for thymidine kinase, which occurs in most living things plus a certain number of undead, maybe not-quite-living (in the usual sense) things known as viruses. Thymine (T) is, of course, an essential ingredient of DNA. Thymidine kinase converts thymidine (thymine bonded to deoxyribose, on the left, below) to the phosphorylated form (TMP, right) so it can participate in DNA synthesis.  

It's important to note that thymidine (above, left) is not a synthesis product but a breakdown product. The biosynthetic pathway for TMP actually starts with dUMP, which is converted to TMP via thymidylate synthetase, an entirely different enzyme. Where does thymidine come from, then, and why does a cell need thymidine kinase? The answer is that thymidine is a breakdown product of DNA. When you eat plant or animals cells, the DNA in those cells gets broken down, and thymidine is one of the breakdown products. For cells, thymidine is a valuable product to have, so thymidine kinase recovers free thymidine via the reaction shown above. This is a scavenging pathway, in other words. Most organisms have it, but some don't (e.g., Pseudomonas lacks it).

When a virus attacks a cell, there's a lot of DNA turnover as the virus prepares to manufacture its own DNA, so thymidine kinase is a handy enzyme to have around if you're a virus. Many viruses, as a result, bring their own copy of the TK enzyme gene. But is the viral TK gene derived (in some kind of ancestral way) from the host cell's own TK gene? Not necessarily.

When you gather up the amino acid sequence data for a bunch of TK genes from bacteria and the phages (viruses) associated with them, you find that, phylogenetically speaking, the phage/viral TK genes are not very similar to the host genes.

Thymidine kinase phylo tree for enteric bacteria and their phages. (Click to enlarge.) Note that the phage genes (top cluster) segregate from the host genes in the lowermost branch. Interestingly, TK genes of various alphaproteobacteria of the Rhizobiales class cluster near the phage genes (much nearer than the enteric bacteria). This suggests a primordial origin of the phage genes rather than recent acquisition of TK from host cells.

In the above tree, phage (viral) genes cluster at the top. Host-cell kinases cluster at the bottom. The two clusters may be related to a distant ancestor (not shown), but one thing is certain: the phage versions of this gene are not simply a slight modification of the host gene. We know that's true because, remarkably, the thyK genes of certain alphaproteobacteria (Agrobacterium and its relatives) cluster with the phage genes, even though the bacteriophages are adapted to E. coli and Salmonella (and closely related enterics). In theory, the phage genes should cluster with the enteric bacteria, not with Agrobacterium and Rhizobium.

Where do the phage genes come from? Some have speculated that these phages originated with escaped bacterial secretion-system cassettes. That may well be, but the escape event had to have occurred many hundreds of millions of years ago. The T4 thymidine kinase gene has only 62% sequence homology with the E. coli gene. T4's version of the gene has a G+C content of 34%, with GC3 (third codon base) of just 19%. The E. coli version of the gene has overall G+C of 42% and GC3 of 36%. While it's generally conceded that evolution of viral genes occurs faster than host genes (particularly for RNA viruses and single-stranded DNA viruses), DNA viruses like T4 can't evolve outside of the host cell, as far as we know, and although DNA viruses may evolve faster than host DNA, they don't evolve thousands of times faster. (RNA viruses do evolve thousands of times faster, but that's an entirely different matter.)

To me, the above tree says that enteric phages diverged from the common ancestor of alphaproteobacteria and gammaproteobacteria (the latter include the enteric bacteria). We're talking hundreds of millions of years ago. (Note that E. coli and its closest relatives are thought to have diverged 140 million years ago.) A recent transfer of thymidine kinase genes from enteric bacteria to their phages is not credible. It's far more likely that an ancient precursor of today's T4 phage (and similar enteric phages) had the gene, and passed it down through the ages as the phages adapted to new hosts (first the alphaproteobacteria, then the phylogenetically newer enterics).

In tomorrow's post, I want to explain in detail how the above phylo tree was made and show you how you can make your own phylogenetic trees using the popular Mega6 program. If you've never used Mega6, you're in for a treat.  

Tuesday, May 13, 2014

Anaerobic Catalases and Codon Pseudodegeneracy

Today I want to demolish two myths, one being that strict anaerobes (such as members of the Clostridia, a class of bacteria that includes the botulism organism as well as the C. diff bug that sickens half a million people a year in the U.S.) lack a catalase enzyme, and the other being that codons have very little degeneracy in base one. Don't worry, I'll parse all this for you so that even if you're not a bio-geek, you'll get it.

Bacteriology students are taught from Day One that strict anaerobes are killed by oxygen because they lack a catalase enzyme. Catalase is the nearly universal enzyme that breaks down peroxides into oxygen and water. The catechism about anaerobes being catalase-negative is true for some anaerobes, but by no means all. In fact, if you go to UniProt.org and do a search on "catalase clostridium" you'll get a long list of hits representing anaerobes that do, in fact, have a perfectly normal catalase enzyme. Some anaerobes have the classic katG (E. coli) version of catalase while others (like C. difficile) have a special manganese-containing catalase. Pathogenic anaerobes probably use the enzyme to combat the respiratory burst that accompanies engulfment by white blood cells. I've written about anaerobic  catalases before; for an overview of the enzymology and phylogenetic breakdown by enzyme type, start here.

Methanospirillum (a strict anaerobe found
in sewage) produces colonies with
characteristic striations spaced two
cell-lengths apart.
I thought it would be interesting to compare the katG genes of two anaerobes to see how they differ. I decided to focus on C. botulinum B (the nasty food-poisoning organism) and the archeon Methanospirillum hungatei strain JF-1, which is named after a professor I had in grad school, Bob Hungate, a true pioneer in the study of anaerobic bacteria.

C. botulinum and M. hungatei are worlds apart, taxonomically. The genome of the former has a G+C (guanine, cytosine) content of just 27%, while the latter has G+C of 45%. Comparing their catalases, I found 717 nucleotide differences between the two genes (which were 2193 bases long in one case and 2157 in the other case). On the surface, the genes seem (from a DNA sequence point of view) quite far apart phylogenetically. But after performing an alignment in Mega6, I was able to study the codons base by base and found that of 717 differences, 500 changes were synonoymous (resulting in no amino acid substitution) while only 217 changes were non-synonymous. Not unexpectedly, most of the synonymous changes were due to differences in the third codon base (where the genetic code is highly degenerate; see below).

Codon Usage in Clostridium botulinum B
Codon Usage: The Standard Code (transl_table=1)
CCA(P) 1.45%
CCG(P) 0.10%
CCT(P) 1.00%
CCC(P) 0.06%

CGA(R) 0.08%
CGG(R) 0.01%
CGT(R) 0.18%
CGC(R) 0.02%

CAA(Q) 2.01%
CAG(Q) 0.25%
CAT(H) 1.10%
CAC(H) 0.16%

CTA(L) 0.77%
CTG(L) 0.10%
CTT(L) 1.67%
CTC(L) 0.06%
GCA(A) 2.66%
GCG(A) 0.20%
GCT(A) 2.18%
GCC(A) 0.19%

GGA(G) 3.30%
GGG(G) 0.47%
GGT(G) 2.08%
GGC(G) 0.40%

GAA(E) 6.40%
GAG(E) 1.11%
GAT(D) 5.12%
GAC(D) 0.62%

GTA(V) 2.79%
GTG(V) 0.45%
GTT(V) 2.81%
GTC(V) 0.14%
ACA(T) 2.36%
ACG(T) 0.16%
ACT(T) 2.22%
ACC(T) 0.19%

AGA(R) 2.43%
AGG(R) 0.31%
AGT(S) 1.89%
AGC(S) 0.48%

AAA(K) 7.10%
AAG(K) 2.17%
AAT(N) 6.06%
AAC(N) 0.89%

ATA(I) 5.68%
ATG(M) 2.50%
ATT(I) 4.01%
ATC(I) 0.42%
TCA(S) 2.10%
TCG(S) 0.12%
TCT(S) 1.68%
TCC(S) 0.12%

TGA(*) 0.02%
TGG(W) 0.69%
TGT(C) 1.02%
TGC(C) 0.25%

TAA(*) 0.24%
TAG(*) 0.07%
TAT(Y) 3.61%
TAC(Y) 0.55%

TTA(L) 5.68%
TTG(L) 0.74%
TTT(F) 3.74%
TTC(F) 0.57%
Pink-highlighted codons are examples of where the genetic code is degenerate for base one. 
A change of CGA to AGA (for example) results in no amino acid change: the 
coded-for amino acid is arginine in either case.


"Degeneracy" means a change in a particular base results in no change of amino acid. An example is the codon AAA (see above), which is the code for lysine (K). Changing the last base to G (resulting in a codon of AAG) still produces lysine in the resulting protein. Codons that begin with 'AA' are two-fold degenerate, because two versions (AAA and AAG) produce the same amino acid. A codon that begins with 'CC' is said to be four-fold degenerate, because any of four different endings (CCA, CCT, CCG, CCC) produce the same amino acid (P, proline).

What's less obvious is that a change in the first (rather than the third) base of a codon can produce synonyms as well. For example, changing CGA to AGA results in the same amino acid: arginine (R). Likewise, CTA changing to TTA still results in leucine (L).

But there's more degeneracy to  the first codon base than just the classic cases outlined in pink in the above table. A change of CTA to ATA produces a switch from leucine to isoleucine, which is tantamount to no change at all, because leucine and isoleucine are isomers (both have the same chemical formula; they differ slightly in arrangement).


Likewise, a change from CTx to GTx produces a change from leucine to valine, which are workalikes. One could even argue that a change from CTT or CTC to TTT or TTC (leucine to phenylalanine) is functionally a degenerate change, because leucine and phenylalanine are nonpolar, hydrophobic amino acids with similar chemical properties.

So in fact, many changes to the first base of a codon can be considered pseudodegenerate, if you will, by virtue of producing non-synonymous changes that are tantamount to very little actual chemical change. We would expect to see a significant number of such changes in mutations affecting base one of codons. And we do.

In the two catalase genes, I looked at changes involving just the first base of a codon and found the following changes:

Synonymous:
AGA<-->CGA (R:R)
AGA<-->CGA (R:R)
AGA<-->CGA (R:R)
TTA<-->CTA (L:L)
TTG<-->CTG (L:L)
TTG<-->CTG (L:L)
TTG<-->CTG (L:L)
TTG<-->CTG (L:L)

Non-Synonymous:
AAA<-->CAA (K:Q)
AAA<-->GAA (K:E)
AAA<-->GAA (K:E)
AAG<-->CAG (K:Q)
AAG<-->CAG (K:Q)
AAT<-->CAT (N:H)
AAT<-->GAT (N:D)
AAT<-->GAT (N:D)
ATT<-->CTT (I:L) *
ATT<-->GTT (I:V) *
CAT<-->GAT (H:D)
CTG<-->ATG (L:M)
CTT<-->ATT (L:I) *
GCA<-->TCA (A:S)
TAT<-->CAT (Y:H)
TCA<-->GCA (S:A)
TCT<-->CCT (S:P)
TTT<-->CTT (F:L) *


Notice that 4 out of 18 non-synonymous changes (see asterisks) fall in the category of pseudodegeneracies. That's slightly higher than the 17% we would expect by chance. So instead of 8 out of 26 first-base changes being degenerate (in the catalase genes), more like 12 out of 26 are either synonymous or near-synonymous. Bottom line, roughly half of first-base mutations are effectively synonymous, at least in this example.

Monday, May 12, 2014

Problems with the Tuberculosis Genome

These days, most genes, in most sequenced genomes, are machine-annotated with a minimum of human intervention, and as a result around 30% of gene annotations are inaccurate, either as to functional assignment or as to reading frame.

It's not hard to find serious errors in published genomes. In the genome of Mycobacterium tuberculosis ATCC 35801, for example, there's a 17-kilobase-pair section of the genome that's full of errors (see below).

A view of two tuberculosis bacterial genomes showing a 17,000-base region (denoted by pink) of 100% sequence similarity. Notice that even though the genomes are 100% sequence-identical in the region, the number, strand orientation, and sizes of genes differ. The yellow gene in the top panel is identified by GeneMarkS+ as "secretion protein EspK," while the yellow gene in the lower panel is identified as DNA Ligase (ligA) and is much smaller (and occurs on the opposite strand).

In this graphic, the same region is shown for two strains of M. tuberculosis. On top is M. tuberculosis Strain EAI5 and on the bottom is M. tuberculosis ATCC 35801. To browse the genomes in your own browser, go to this link and click the pink Run GEvo Analysis! button. When the panels appear, you'll be able to click on individual genes to see what they (supposedly) are.

The yellow-colored gene in the top panel is annotated as "secretion protein EspK," while the corresponding region in the bottom genome has two genes (green and yellow) on opposing strands of DNA, with one gene (in green) given as a "hypothetical protein" and the other (in yellow) as "DNA ligase." Note that the entire region covered by the pink bands is 100% identical from organism to organism in terms of DNA sequence. And yet one annotation program found 13 genes while the other found 17.

Sadly, this is not an unusual situation.

How can you tell which annotations are correct? Unfortunately, it takes some investigation. Consider the two yellow genes. They can't both be correct as shown, because one is twice as long as the other and each is on a different strand of the DNA! And yet, if you obtain the DNA sequence of each, and use it as a query sequence in a BLAST search, you'll come up with "good" hits for each gene (because there are other incorrectly identified genes in public databases, matching each query).

It turns out the "secretion protein EspK" (the large yellow gene in the top panel) is correctly identified. To determine this, I downloaded the FASTA sequence for the large gene, plus the sequence for the same gene in the same location in the M. canetti CIPT genome (identified as "hypothetical alanine and proline rich protein"), plus the same gene in M. canetti CIPT 140070010, with the idea of identifying the differences in the (aligned) DNA sequences with respect to their codon positions.

When I had a script identify mutations by codon location, the number of differences at the three possible codon locations, were:

base 1: 4 changes
base 2: 2 changes
base 3: 19 changes

This is exactly the pattern you would expect if the reading frame is correct. Base 3 (the wobble base) tends to accumulate the majority of mutations, because changes in a codon's third base usually result in no amino acid change (due to the genetic code's degeneracy). Base 1 has a small amount of degeneracy and thus has the next-highest number of mutations. Base 2 has no degeneracy; all changes to this base result in a different amino acid. (All changes are non-synonymous, in other words.) 

I repeated this check using the so-called "DNA ligase" gene of Mycobacterium tuberculosis ATCC 35801, but using sequence data from the bottom strand of DNA, with 1329 bases' worth of additional 3'-end sequence data in order to cover the same amount of DNA as the other sequences. I got difference data of 3, 1, and 8 for bases one, two, and three. This verified that the active gene is actually on the bottom strand.  ERDMAN_4254 is incorrectly annotated as a top-strand DNA ligase. Instead of two genes, ERDMAN_4253 and ERDMAN_4254 (on two opposing strands), Mycobacterium tuberculosis ATCC 35801 should be shown as having a single large gene on the minus-one strand. The correct identification (based on more than thirty E=0, 100% identity hits in other strains) is, in fact, "secretion protein EspK."

Unfortunately, at least 3 other tuberculosis strains, namely M. tuberculosis GuangZ0019, plus Strain CCDC5180 and Strain CCDC5079, also harbor an incorrect ligA annotation, and Mycobacterium isn't the only organism with a "bogus ligA" problem. Micromonospora sp. M42 also has a fake ligA gene and I'm certain there are others.

Bottom line, don't believe everything you read in genome annotations. The annotation may say "DNA ligase," but you could actually be looking at something else entirely.




Sunday, May 11, 2014

Making Sense of Antisense

One reason genomes are so poorly annotated is that annotation software (of the Glimmer variety) gets easily confused by high-GC-content genome data. When a genome is high in guanine and cytosine, relatively few stop codons are present in alternate reading frames. (Recall that DNA is read in triplets of letters, called codons: AAA, AAC, AAT, AGT, etc. There are 6 possible reading frames for any given segment of DNA, representing 3 forward reading frames and 3 backward frames.) Stop codons (TGA, TAG, TAA) are mostly composed of A and T, not G and C. But also, it so happens that most protein-coding genes follow a certain pattern of codon construction. The first base in a 3-letter codon is usually A or G (about 60% of the time, in most genes, in most organisms). The second base is highly variable in all respects. The third base is usually reflective of overall genome composition: If the genome is high in G and C, the third base of each codon will tend to be G or C. (This happens almost all the time in Streptomyces, for example, where the third base G+C content is 97%.) If the genome is high in A and T, the third codon base will be high in A and T.

What's perhaps unexpected is that the same compositional pattern can sometimes work in reverse, on the complementing DNA strand. When you look at a protein's codons and see that 60% use a purine in the first base and but only 40% use a purine in the third base, this means that the reverse complement of the codon also has 60% purine content in base one and 40% purine content in base three. Example: the codon GCT (alanine) has a purine (G) followed by a pyrimidine (C) followed by a pyrimidine. The reverse complement codon, AGC (serine) also begins with a purine (A) and ends with a pyrimidine (C). This type of symmetry tends to be a confounding factor for programs like Glimmer that try to distinguish sense from antisense strands and coding from non-coding regions, and normal reading frames from nonsense frames.

Perhaps some real-world data will make this clearer. Below is a plot of AG1 (purine content at base one) versus GC1 (guanine plus cytosine, base one) for all codons of all genes of the soil bacterium Pseudomonas fluorescens PF0-1. Each point represents one gene's worth of data. For each data point, I simply went through all of that gene's codons and tallied up the A, G, C, and T at each base position, then found the average AG1 and GC1 for the gene in question. I did this for all 5,722 protein-coding genes in the genome. (Don't worry. Scripts do the whole thing in the blink of an eye. It takes less than 10 milliseconds to process one gene's worth of data.) Notice how the points cluster at y=0.6, meaning most genes have an average AG1 (first-base purine) content of around 60%.

AG1 (purine content, base one) versus overall G+C content, for N=5722 protein-coding genes of Pseudomonas fluorescens PF0-1. Each dot represents statistics for one gene. Notice that the dots tend to cluster at about y=0.6.

Now have a look at the graph below, which is the same kind of plot except we're looking at data for the third codon base. Here, the median y-value is only 0.453, meaning that purine content averages around 45% in the third base. That means on the opposite strand, in the same position, there's a purine ~55% of the time.
AG3 (purines, base three) versus G+C for all protein-coding genes (N=5722) in P. fluorescens PF0-1. The median y-value is 0.453. The long comet tail in the direction of low G+C could be due to "AT drift." It could also partly be due to incorrect annotation of genes as to "sense" versus "antisense" strands.

I ran some numbers and found that in P. fluorescens, 74% of genes have codons that are purine-heavy on the front (AG1 greater than 55%), in the normal reading direction, but most genes (53%) are also purine-heavy in base one when read in the reverse-complement sense. In fact, for every three protein genes that have AG1 greater than 55% and GC3 above 60% in the normal reading frame, there are two genes that meet the same criteria when translated in the reverse-complement frame. This means that for quite a few genes in Pseudomonas, the normal reading frame has similar compositional statistics to the reverse reading frame. (Note that codon base two tends to average 48% purines in the forward direction and 52% in the back direction.) To a program like Glimmer, many protein genes look surprisingly similar whether read from the sense strand of DNA or the antisense strand. Distinguishing sense from antisense is not trivial, in other words, although in an upcoming post I'll talk about a way to do it via codon bias.

To get a better idea of how universal this bidirectional codon symmetry might be, I obtained the codon statistics for 109 organisms and calculated the average base-3 composition stats, and came up with the following graph that plots AG3 (purines, base three) against overall genome A+T content:
Codon base three purine content versus genome A+T content for N=109 bacterial species. Every point represents aggregate codon stats for one organism. The size of each circle is proportional to genome size of the organism.

The size of each circle is proportional to the genome size of the organism in question. (As you can see, organisms with high genomic A+T content often tend to have smaller genomes.) Notice that base three tends to be a pyrimidine (AG3 less than 50%), in the normal reading frame, for almost all organisms. (That means it's a purine in the reverse reading frame.) A few circles appear above y=0.5, but not many, and not by much.

Bottom line: Distinguishing sense from antisense DNA is not a straightforward matter, since many codons have similar composition statistics in forward and backward frames. Breaking the deadlock might require finding a Shine Dalgarno sequence for one frame but not the other, or it could mean running a homology check (via BLAST) against similar genes in a database, but in that case you have to hope the strand assignment was correct in the database genes (which it often is not).

Incidentally, I know of no a priori reason why the wobble base should accumulate pyrimidines preferentially (even though that's what the data say). In theory, base three is (for most codons) a degenerate position and should be neutral (free to accumulate bases of any type). We know that base one tends to be a purine 60% of the time. The fact that base three is a pyrimidine 54.7% of the time is suspicious (arguably) and tends to imply that some genes are annotated backwards. As we'll see in a future post, the backwards-annotation problem isn't a huge issue in most organisms, but it's not negligible, either.

Saturday, May 10, 2014

The "Hypothetical Protein" Problem

Once a genome has been sequenced, identification of "coding" regions of DNA (and assingment of functions to those regions) is almost always done by software. The Glimmer (Gene Locator and Interpolated Markov Modeler) freeware package, available from Johns Hopkins University, is a widely used annotation system. It finds putative genes, identifies which strand they're on, locates start and stop codons, etc.

The technology is suprisingly good (Glimmer reputedly finds 99% of genes) but it also has serious limitations. Programs like Glimmer have a relatively easy time locating open reading frames in low-GC (low guanine, cytosine) genomes but have a more difficult time identifying the "sense" strand in high-GC genomes. Also, gene size matters: Glimmer is great at finding sizable genes but is less accurate with small genes. These shortcomings are not unique to Glimmer but apply to all gene-detection software that I'm aware of.

In almost every bacterial genome, 20% to 40% of genes cannot be identified as to function and are tagged "hypothetical protein." These genes tend to differ from other genes in various ways. Size, for example.

Gene size distribution for N=3412 genes of assigned function (blue) and N=793 "hypothetical protein" genes (red) in E. coli B.
The above graph shows the size distribution of protein-coding genes in E. coli B. Hypothetical protein genes (N=793) are shown in red; genes with an assigned function (N=3412) are in blue. You can see that at gene sizes below ~500 base pairs, genes tend (more often than not) to be labeled "hypothetical protein." In E. coli, hypothetical-protein genes also tend to have lower GC content in the third codon base (the so-called "wobble" base). Whereas GC3 has a median value of 56.8% in genes with assigned functions, GC3 has a median value of 51.6% in hypothetical proteins.

In most genes, codons tend to begin with a purine base (A or G) 60% of the time. In E. coli, it's 59.8% of the time. Genes in which AG1 is less than 50% are rare. In E coli, only 3.7% of function-assigned genes have AG1 under 50%, but 6.9% of hypothetical protein genes have AG1 under 50%, and hypotheticals are three times as likely to have AG1 (purine content, codon base one) less than AG2 (purine content, base two). The latter is an important "sanity check" of reading frame. One would only expect AG1 to be less than AG2 in genes with a frameshift error or incorrect reading-frame assignment.

If 6.9% of hypothetical protein genes in E. coli are in the wrong reading reading frame, that's over 100 genes. (Hypothetical genes comprise 19% of E. coli's 4205 CDS genes.)

When I made a cursory spot check of "hypothetical protein" genes in E. coli B, I quickly found a number of genes (ECB_00841, ECB_01484, ECB_01676, ECB_02804, ECB_03339, and others) that were incorrectly designated as to the sense versus antisense strand. These genes produced more and better BLAST hits when reverse-translated than when forward-translated. Some of them had stop codons in the middle, but that's okay. That just means they're pseudogenes. It's vitally important to correctly identify pseudogenes. (In most bacterial genomes, pseudogenes are vastly underreported, because they're hard to distinguish from non-coding regions.)

Bottom line, there's ample reason to believe a substantial fraction of the 793 "hypothetical protein" genes in E. coli are in fact either in the wrong reading frame, are pseudogenes, or have an improperly located start codon (or harbor other serious errors).

I don't think E. coli is exceptional. I'm convinced misannotation is a serious issue affecting 20% or more of all machine-annotated genes. Some annotation programs do a better job than others of hiding problems like overlap resolution and small-gene detection, but the problems are there, bigtime, and aren't likely to go away any time soon.

Friday, May 09, 2014

A JavaScript Tip for Biohackers

Today I want to give a short bio-hacking code lesson, and I want to keep it as simple as possible in case you're just starting to use JavaScript and want to see what kinds of things it can do. This lesson could prove handy to you even if you're not a gene hacker, because it shows how easy it is to manipulate text in the browser.

For this example, I am going to assume that you have a text file open in the browser. In particular, I'm going to assume the text file consists of a listing of two genes sequences in FASTA format, which is actually very simple: FASTA consists of a header line that starts with the > (greater than) angle bracket, followed by one or more lines of AGTC base sequence data (if it's a nucleotide sequence) or a bunch of letters (KPRMIV etc.) if it's a protein sequence. For example:

>M. leprae MBLr_2224...    <-- this is the header
ATGGCGGTGCTGGATGTC...      <-- this is the data

Your file might have several genes in FASTA format, one after the other. The question is: How can you read this data with JavaScript?

Actually, it's extremely easy to read text data. Open any text file with your browser, then open a JavaScript window: With Firefox, use Shift-F4 to bring up the Scratchpad, or with Chrome use Control-Shift-J. To read the text into a JavaScript variable called text, just type the following line into the script editor:

text = document.body.textContent;

Execute this code with Control-L in Firefox or by simply entering a carriage return in Chrome. (If Firefox fills your Scratchpad with text, bracketed by /* and */,  that's good; it means the code worked. Delete it and proceed.)

Suppose you have several FASTA records in a row and you want to have an array of gene data. The easy thing to do is "split" the records at each header, discarding the header:

r = />[^\n]+\n/g;
genes = text.split( r );

NOTE: To enter more than one line of code in the Chrome console, you have to hold the Shift key down before hitting Enter. Otherwise, Enter executes the code.

The first line defines a regular expression (r) for the pattern: "greater-than symbol followed by one or more non-newline symbols, followed by a newline." (The caret symbol ^ means to negate whatever's in between the square brackets.) When this code executes, genes will be an array of data, but because of the way split() works, the first item (item zero) in the array will be empty, so get rid of it with:

genes.shift();

Now the genes array will contain gene data. The data for gene No. 1 will be in gene[0], the data for gene No. 2 will be in gene[1], etc.

Incidentially, if you want an array of headers, just do:

headers = text.match( r );

No need to do headers.shift(). The match() operation creates an exact array.

If your genes are aligned, you can compare them, base to base, in a loop. (If your genes are not aligned, create an alignment using an online ClustalW alignment tool or using the popular Mega6 program.) The following loop construct compares the first 300 bases in two genes, and tallies the differences according to whether the difference occurred in codon base one, base two, or base three:

snp=[0,0,0]; // array to hold base 1,2,3 results
gene1 = genes[0];
gene2 = genes[1];
 
for (var i=0; i < 300; i++)
   snp[ i % 3 ] += gene1[i] != gene2[i];

You can display the results in the console simply by adding (on its own line) snp; or console.log(snp). Or if you want to see it in a dialog, execute alert(snp).

The final line of code deserves explanation. Results are placed in the snp (single nucleotide polymorphism) array according to whether the "hit" occurred in base 1, base 2, or base 3 of a codon. The i % 3 construct (i modulo 3) means divide i by 3 and throw away the "answer" but keep the remainder. (So for example, 5 % 3 equals 2, 6 % 3 equals zero, 7 % 3 equals 1, etc.) As i increments, i % 3 simply takes on values of 0, 1, 2, 0, 1, 2, etc.

On the right side of the equals sign we have gene1[i] != gene2[i]. This means we want to compare the two genes at an offset of i, and if they are not equal (!=), tally the resulting value (true, or 1) numerically. The numeric result is added to the appropriate slot in snp. Note that JavaScript treats true as having a numeric value of one and false as having a numeric value of zero. It knows to cast the boolean value to a number because of the += symbol (which means numerically add the following value to the existing value of the lefthand variable).

I hope this short tutorial wasn't too painful. If you're new to JavaScript, my advice is: Experiment in the console (Firefox's Scratchpad or Chrome's JS console) a lot, and get familiar with the string functions split() and match(), because they can be incredibly useful, not to mention speedy. Either of those two functions can take a string argument or a regex. But remember to put a 'g' after the regex if you want the operation to occur globally throughout the string. For example:

"abcdefabcdef".match( /abc/ )

will only match the first occurrence of abc, whereas

"abcdefabcdef".match( /abc/g )


will match both occurrences of abc and give you an array of matches.

Thursday, May 08, 2014

Ancient Drug Resistance Genes

Antibiotic-resistant bacteria have been in the news lately, with the usual scary talk about drug-resistant bacteria taking over the world, with a return to the Dark Ages if we don't Do Something.

Certain facts about drug resistance tend to get lost in these sorts of news stories. There's a common misconception that somehow the introduction of antibiotics (first in medicine, then in agriculture) caused bacteria to invent new drug-resistance genes out of nowhere. That's nonsense, of course. The Lederbergs showed in 1951 that drug resistance genes were/are preexisting, and did not come into being because of antibiotics. Rather, bacteria already equipped with such genes were able to survive treatment with antibiotics. Around 1968 it also became clear that bacteria can share drug resistance genes by exchange of extrachromosomal DNA (plasmids). Bacteria that don't have the magic genes can get them from their buddies.

But, so. How is it that the genes for antibiotic resistance already existed prior to the introduction of antibiotics into the food chain and into medicine? The answer is simple: Antibiotics are natural products produced by common soil and water microorganisms. Penicillin is produced by a mold; streptomycin is produced by the soil bacterium Streptomyces. Certainly hundreds (maybe thousands) of different kinds of antibiotics exist in the natural environment. They've been there for millions of years.

Mycobacterium tuberculosis ATCC 35801 (Erdman strain) has two "multidrug  resistance" genes, and they reside on the main chromosome (not on a plasmid). They're shared by other members of the Mycobacterium genus, including the leprosy bacterium, M. leprae. In the latter, one of the genes in question (MLBr_2224) is a pseudogene. It's interesting to compare this pseudogene with its counterpart, Erdman_0866, in Mycobacterium tuberculosis ATCC 35801. The two genes are nearly the same size: 1543 base pairs versus 1638. This is quite interesting in itself, inasmuch as most pseudogenes in M. leprae are truncated (averaging just 795 base pairs). But the comparison gets even more interesting when one does a side-by-side analysis of single nucleotide polymorphisms (changes to individual bases).

First I did a SNP comparison of the tuberculosis drug-resistance gene to its counterpart in M. indicus pranii, where there were 325 total base-pair differences, distributed amongst the 1st, 2nd, and 3rd base pairs of codons as:

98, 72, 155

This is the expected pattern: The largest number of mutations tends to accumulate in the third codon base (the so-called "wobble base"), because mutations in this base give a large percentage of synonymous codon changes owing to codon degeneracy. The next-highest number of mutations is expected in the first base, because there is some (but not much) degeneracy in that position. All mutations in the second base are non-synonymous and likely to affect enzyme function. Hence, the lowest number of mutations occurs in the second base.

When I ran the same check between the M. tuberculosis gene and its counterpart (the pseudogene) in M. leprae, I found that the mutational differences totaled 418 changes and segregated by base as:

119, 103, 196

Unexpectedly, the same pattern emerges. The reason this is unexpected is that one expects a pseudogene to contain frameshifts that would destroy the reading frame, mooting 1st/2nd/3rd-base comparisons. More generally, one expects massive random mutations all along a pseudogene's length (of the kind that would tend to make all three of the above numbers equal, even without frameshifts). So the fact that we still see the familiar pattern of 1st/2nd/3rd-base mutations is interesting.

The genes in question were unequal in length, so to do these comparisons I had to disregard a 112-base leader portion of the aligned genes. (I aligned the genes via ClustalW using Mega6.) I also ignored the unequal-length trailer portions of each gene, analyzing only the interior (aligned) portions from base 112 to base 1418. Also, to facilitate the analysis, I removed one adenine (representing a one-base insertion) in the M. leprae gene, at position 407, to restore the reading frame. But it's interesting that at that position there's a run of six adenines, indicative of a slippery site, of the kind associated with frameshift signalling.

The M. leprae pseudogene MLBr_2224 behaves as if it is still a normal gene, accumulating mutations in the "right places." An alternative explanation is that the M. leprae and M. tuberculosis genes were already in pretty much this orientation relative to each other before the two species diverged, and M. leprae simply accumulated very few additional mutations in the pseudogene after it became a pseudogene. (M. leprae is assumed to have experienced a massive pseudogenization event between 9 and 20 million years ago.)

For more about M. leprae's strange assortment of pseudogenes, see this post and also this post.

Wednesday, May 07, 2014

Fixing the Genome Annotation Nightmare

Gene misannotation is a huge ongoing problem in biology and is not getting any better. Not long ago (December 2009), Schnoes et al., in a study called "Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies," remarked:
The manually curated database Swiss-Prot shows the lowest annotation error levels (close to 0% for most families); the two other protein sequence databases (GenBank NR and TrEMBL) and the protein sequences in the KEGG pathways database exhibit similar and surprisingly high levels of misannotation that average 5%–63% across the six superfamilies studied. For 10 of the 37 families examined, the level of misannotation in one or more of these databases is >80%. Examination of the NR database over time shows that misannotation has increased from 1993 to 2005. The types of misannotation that were found fall into several categories, most associated with “overprediction” of molecular function. These results suggest that misannotation in enzyme superfamilies containing multiple families that catalyze different reactions is a larger problem than has been recognized.
There are several important aspects to the problem:
  1. Several types of gene discovery software are in common use (each with its own strengths and weaknesses) and there is little standardization of output.
  2. Most gene discovery programs (such as the widely used Glimmer) must be trained against high quality training-set data in order to produce reliable output.
  3. Faulty data, once it enters the gene databases, becomes part of the training set for other researchers trying to train their software. Hence, error propagation is a major ongoing problem.
  4. Most gene discovery programs are not good at resolving gene-overlap situations, sometimes relying on unsophisticated heuristics to determine sense/antisense strand assignment.
  5. Crosschecking of results is a timeconsuming, manual task and is rarely done for an entire genome.
  6. Public gene databases do not usually include scoring info (a rating system that rates the confidence level of a given gene's function assignment).
Until these issues are addressed, gene databases will continue to acquire data of dubious quality, leaving it to the user to double-check the data as best as he or she can.

A confidence-scoring system is very much needed, so that a person can tell at a glance whether a given annotation is trustworthy, and to what extent.

Ironically, inability to find genes (underreporting of results) is not a problem. False positives, on the other hand, are a big problem. Glimmer 3 reportedly produces many fewer false positives than the previous version of the software, but probably two thirds of existing public genomes were annotated with Glimmer 2 and have yet to be revised.

BLAST searches (to crosscheck a gene against other genes) are of limited utility since there is already so much corrupt data in public databases. It's not uncommon to check a gene that has a questionable functional assignment and find that a BLAST search brings back genes with equally questionable functional assignments (because the same annotation software has choked before on orthologous genes). Nevertheless, sometimes careful examination of a gene in relation to nearby genes (which often share similar functionalities), inspection of its Shine Dalgarno region (if it exists), inspection of GC3 values, etc. can help verify that a gene really is what the annotation says it is. Sometimes it's even possible to assign a function to a gene that's marked as "hypothetical protein." This sort of work is tedious, but it's exactly the sort of work that could and should be crowdsourced to spare-time biohackers. It would help if a system were in place for volunteers to help with basic "sanity checking" of machine-annotated genes. It would take quite an army of volunteers to go through the 20,000 or so bacterial genomes (each with several thousand genes, on average) already on the books, but who knows if maybe a Craig Venter couldn't organize such an effort and make it work?

One thing is for sure. If we don't do something to improve genome annotation quality, we'll soon be drowning in junk annotations; never mind junk DNA.