Schizophrenia is another one of those conditions (like alcoholism) that we're constantly being told is highly heritable, based on the results of twin studies and various other findings (from studies that pre-date the DNA-sequencing era). Studies "proving" the heritability of schizophrenia are, in general, rarely questioned, or even reviewed critically, due to the fact that a genetic basis for schizophrenia makes so much intuitive sense and agrees so well with ordinary experience. Everybody knows someone who "has schizophrenia in their family," and it very much does seem to travel in families.
The fact that something travels in families is not automatically indicative of a genetic connection, but most people aren't ready to believe such a thing since it goes against common sense. Nevetheless, in the past few decades, a great deal of work (much of it quite compelling) has gone into the study of things like intergenerational trauma, with researchers finding, for example, that the course of PTSD in modern-day Israeli military veterans is much different for soldiers whose parents or grandparents were Holocaust survivors. (I talk about intergenerational trauma in Of Two Minds. In fact, there's a whole chapter on Trauma.)
Social workers and criminologists have known for years that childhood physical and sexual abuse tend to be intergenerational. Yet no one seriously suggests there's a "child abuse gene" or a "child molestation gene." A lot of things that aren't purely genetic are trans-generational; familial. Schizophrenia itself may be in this category. The links between childhood trauma and schizophrenia are quite strong and well known. Indeed, the literature shows that the associations between childhood trauma and schizophrenia are actually as strong as or stronger than the associations between childhood trauma and other disorders that we know have a high correspondence to childhood trauma (including depression, PTSD, anxiety disorders, dissociative disorders, eating disorders, personality disorders, substance abuse, and sexual dysfunction, among others).
Still, the search for "schizophrenia genes" goes on, and every so often we hear claims for this or that group of scientists having discovered some new genetic feature that confers high risk for schizophrenia. The latest of these reports comes from the Schizophrenia Working Group of the Psychiatric Genomics Consortium, which published a piece in Nature Letters in July 2014 called "Biological insights from 108 schizophrenia-associated genetic loci." The report trumpets the finding of "108 conservatively defined loci that meet genome-wide significance, 83 of which have not been previously reported." This is a genome-wide association study, subject to the usual GWAS caveats.
Of the 108 loci, 75% were associated with protein-coding genes, a very high number for a GWAS study, and quite a few genes involve physiologically relevant functions.
On the surface, it sounds promising.
Where things start to fall down is in determining how much heritability risk the 108 loci can account for. The authors said: "Assuming a liability-threshold model, a lifetime risk of 1%, independent SNP effects, and adjusting for case-control ascertainment, RPS" (risk profile scores) "now explain about 7% of variation on the liability scale to schizophrenia across the samples, about half of which (3.4%) is explained by genome-wide significant loci." To parse that: We know the lifetime risk of schizophrenia is 1%. If we total up the disease risk attributable to all of the 128 markers found in the study, we can explain 7% of schizophrenia risk. But if we limit ourselves to the 108 markers that met criteria for genome-wide significance (a stringent test intended to reduce the risk of false positives), we can explain only a little more than 3% of schizophrenia risk.
So once again, it's a mixed bag: a genome-wide association study (GWAS) that's filled with interesting findings (some of which may prove to be important), but that fails, quite miserably, to explain heritability.
Many people have commented on the possible reasons why genome-wide association studies have been hit-or-miss in their ability to explain heritability, why they so often find SNPs (single nucleotide polymorphisms) that don't map to genes, why the results often don't replicate well, etc. I think if we're honest, we have to admit that GWAS isn't always the right tool for the job. It's a great tool, but you have to remember how it works. GWAS attempts to match fairly common (i.e., occurring in 1% or more of the population) single nucleotide polymorphisms (which you can think of as point mutations) in databases that have been specially built to accommodate the most common polymorphisms. In a GWAS, you aren't doing deep sequencing of people's genes; you're looking for markers that cover a rather tiny percent of the genome. These markers occur commonly, which almost by definition prevents them from being very serious defects, because serious genetic defects are rendered rare by evolution (natural selection removes them from the gene pool over time). Therefore, in a technique that, by design, looks mainly at commonly occurring SNPs, which are overwhelmingly neutral (from an evolutionary standpoint), we should not be surprised to find that the markers that get spotted in GWAS (as genome-wide significant) are essentially neutral mutations without much phenotypic effect.
Unfortunately, we're at an awkward point in the history of science, because technology has given us some powerful DNA analysis tools (and the computers needed to crunch the data), but we're not quite yet at the point where detailed, deep genetic sequencing of whole human genomes is economically feasible for a study that includes, say, a thousand cases and a thousand controls. (Right now, it costs about $10,000 to do the necessary sequencing.) When a human genome can be reliably sequenced for under $2000, we'll have the tools to do really serious, deep genetic studies of hard-to-pin-down diseases. Whether we'll have the computer power needed to do the necessary statistical analysis on the scale of large (N > 1000) case-control studies is another matter; remember that in GWAS we're dealing with perhaps 500,000 data points per genome whereas in deep sequencing we're generating billions of data points per genome. It'll probably be doable using Amazon AWS/EC2 infrastructure (or Google's compute-cloud). Another reason to invest in Amazon, probably.
I'm in the process of writing a 100,000-word book on mental illness. It's evidence-based; part science, part memoir, with over 200 footnotes (so you can refer to the scientific literature yourself). To follow the progress of the book, check back here often, and also consider adding your name to our mailing list. Thanks!
Showing posts with label genetics. Show all posts
Showing posts with label genetics. Show all posts
Saturday, January 10, 2015
Wednesday, January 07, 2015
Winner's Curse
Before saying anything more about genetic studies, I want to talk about a problem that's quite common in scientific studies. It's colloquially (and somewhat inappropriately, since the analogy doesn't hold 100%) known as the "winner's curse."
In economics, the idea of "winner's curse" comes from the observation that the winner of an auction often overpays. Maybe you've noticed this phenomenon yourself, if you've participated in auctions? It's not just that people get into a frenzy of bidding and drive prices too high. That's not really the core idea here. The core idea is more like this: Suppose a laptop goes on auction and there are ten people in the room. All will bid on it. Every one of the ten persons has an estimate (a top bid) in mind for the true value of the laptop. The average of those ten estimates might be (let's say) $300. However, an average bid doesn't win an auction, does it? The highest bid does. And that might be considerably in excess of $300.
The "winner's curse" phenomenon comes into play in science a lot, because (for example) when a systematic review is done for various studies that found the "effect size" for a given type of medical treatment, the effect size often turns out to be larger in smaller studies. In theory, a treatment should give roughly the same effect size in all different sizes of study. If aspirin is a successful treatment for headaches 50% of the time, a study involving 30 people should show that, and a study involving 1000 people should show that. Instead, what you often see is (here I'm making numbers up) aspirin works 60% of the time in a small study and 50% of the time in a large study.
You can see how this might happen. Suppose four different teams of scientists (who don't talk to each other) decide to investigate the effectiveness of a vitamin milkshake as a hangover cure. And suppose that the "true effectiveness" of this "cure" (over a large enough number of studies and subjects) is actually about 20%. However, our four teams, working with very small study populations (hence, a high potential for statistical noise) find effectiveness of 11%, 13%, 22%, and 30%. The teams that got the lowest numbers probably won't publish their results. The team that got 30% probably will.
This sort of thing happens in science a lot and helps explain why, for example, some of the early twins studies on schizophrenia found high concordance rates (over 60%) for schizophrenia in identical twins versus fraternal twins, in relatively small studies, whereas later, much larger studies have found rates as low as 11%.
"Winner's curse" is a well-known ascertainment problem in science that affects studies of many kinds, including some of the recent large genome-wide associataion studies that have produced findings that failed to replicate when other teams decided to do similar sorts of investigations. I'll be talking more about that about soon.
Note: For a more technical discussion of "winner's curse" in genetic studies and what can be done about it, see Sebastian Zöllner and Jonathan K. Pritchard, "Overcoming the Winner’s Curse: Estimating Penetrance Parameters from Case-Control Data," Am J Hum Genet. Apr 2007; 80(4): 605–615.
I'm working on a no-nonsense, "skewer the sacred cows" book about mental illness. If you'd like to follow the progress of the book, sign up for the mailing list, and be sure to check back here regularly. Thanks!
In economics, the idea of "winner's curse" comes from the observation that the winner of an auction often overpays. Maybe you've noticed this phenomenon yourself, if you've participated in auctions? It's not just that people get into a frenzy of bidding and drive prices too high. That's not really the core idea here. The core idea is more like this: Suppose a laptop goes on auction and there are ten people in the room. All will bid on it. Every one of the ten persons has an estimate (a top bid) in mind for the true value of the laptop. The average of those ten estimates might be (let's say) $300. However, an average bid doesn't win an auction, does it? The highest bid does. And that might be considerably in excess of $300.
The "winner's curse" phenomenon comes into play in science a lot, because (for example) when a systematic review is done for various studies that found the "effect size" for a given type of medical treatment, the effect size often turns out to be larger in smaller studies. In theory, a treatment should give roughly the same effect size in all different sizes of study. If aspirin is a successful treatment for headaches 50% of the time, a study involving 30 people should show that, and a study involving 1000 people should show that. Instead, what you often see is (here I'm making numbers up) aspirin works 60% of the time in a small study and 50% of the time in a large study.
You can see how this might happen. Suppose four different teams of scientists (who don't talk to each other) decide to investigate the effectiveness of a vitamin milkshake as a hangover cure. And suppose that the "true effectiveness" of this "cure" (over a large enough number of studies and subjects) is actually about 20%. However, our four teams, working with very small study populations (hence, a high potential for statistical noise) find effectiveness of 11%, 13%, 22%, and 30%. The teams that got the lowest numbers probably won't publish their results. The team that got 30% probably will.
This sort of thing happens in science a lot and helps explain why, for example, some of the early twins studies on schizophrenia found high concordance rates (over 60%) for schizophrenia in identical twins versus fraternal twins, in relatively small studies, whereas later, much larger studies have found rates as low as 11%.
"Winner's curse" is a well-known ascertainment problem in science that affects studies of many kinds, including some of the recent large genome-wide associataion studies that have produced findings that failed to replicate when other teams decided to do similar sorts of investigations. I'll be talking more about that about soon.
Note: For a more technical discussion of "winner's curse" in genetic studies and what can be done about it, see Sebastian Zöllner and Jonathan K. Pritchard, "Overcoming the Winner’s Curse: Estimating Penetrance Parameters from Case-Control Data," Am J Hum Genet. Apr 2007; 80(4): 605–615.
I'm working on a no-nonsense, "skewer the sacred cows" book about mental illness. If you'd like to follow the progress of the book, sign up for the mailing list, and be sure to check back here regularly. Thanks!
Tuesday, January 06, 2015
The Problem of "Missing Heritability"
When the human genome was sequenced in 2003, the expectation was that scientists, armed with a powerful array of new genetic techniques, would very quickly identify the genetic correlates of things like schizophrenia, depression, autism, and alcoholism, which we supposedly "know" have a large genetic component. Extremely large, well-funded genome-wide association studies (GWAS) were begun. (See this PLoS paper to learn how such studies are conducted.) Surprisingly, the studies failed, for the most part, to converge on genes, gene copy number variants, SNPs (single nucleotide polymorphisms), gene rearrangements, mutations, or other peculiarities of allelic architecture whose presence could predict important diseases. When genetic loci of interest were found, the findings often didn't replicate in followup studies; or if they did, the relative odds ratios (a measure of the ability of genetic features to predict the prevalence of a trait) fell far short of explaining the known magnitudes of various traits in target populations.
If things like schizophrenia and alcoholism truly do have a strong genetic basis (as we've been told), and if they were to involve tractable numbers of genes (say, dozens or scores of genes, rather than hundreds or thousands), DNA studies of the GWAS kind should produce immediate, strong, recognizable genetic signatures of disease. The results should leap off the page. But they don't. Time and again, the very few candidate alleles that are found are either found in low numbers, or have low "penetrance" (low capacity to predict disease), or both, and quite often the candidate alleles that are found are not confirmed by followup studies.
This has given rise to major crisis in genetics, summed up in a paper called "Finding the missing heritability of complex diseases" that appeared in Nature in 2009. Scientists are desperate to explain the "missing heratibility" of disorders we "know" are genetic.
It's assumed, by most scientists, that the failure of genome-wide association studies to find genetic explanations of complex diseases can be attributed to such studies' low power and resolution for genetic variations of modest effect. (So the quest has begun to increase the study size of future trials, in hopes of seeing more robust results.) In addition, there's a reasonable expectation that many complex diseases will doubtless be found to involve numerous genes, each contributing only a small effect to the total. It's also been suggested that some traits are dictated by extremely rare genetic features with high "penetrance." There's also an awareness that current laboratory techniques have low power to detect gene-gene interactions. What no scientist wants to say, however—and what none of the 24 co-authors of the Nature article (above) would say—is that maybe the genetic component(s) of schizophrenia, alcoholism, depression, etc. are simply vanishingly small to begin with. Yet that's exactly what the data are telling us. But we won't listen to the data because it doesn't fit our preconception of how the world should work.
We "know" that a person's height is largely controlled by genes. Studies going back almost a century have determined that body height is 80% to 90% heritable; no one seriously questions this fact. (Height is heritable—it "runs in families.") However, at least three large, modern genetic studies have been done to find "height genes"; the largest involved over 180,000 study subjects (and 291 co-authors). In all, some 180 genetic loci were identified that play a role in determining a person's height. But the 180 genomic features, put together, accounted for only 10% of observed variations in height. The rest appears to be environment.
Are we now supposed to enlarge our "study population" from 180,000 to several million, in order to find the genetic explanation for body height, just because we "know" one exists?
At some point, don't we have to just admit "the data's the data"?
Shouldn't we be willing (at least provisionally) to entertain the idea that maybe the twin studies and the data showing that certain things "run in families" (pre-GWAS-era data, mostly) are in need of reevaluation? Shouldn't we at least consider the idea that prior studies misjudged the importance of uncontrolled-for environmental variables? Now that we have powerful DNA-analytic techniques for investigating heritability, and the techniques aren't giving us the results we want, must we "increase the size of the microscopic" to make the results look bigger? Or shouldn't we just accept what the data are telling us (as painful as that may be)?
In a future post, I'll talk about some of the GWAS results for alcoholism. Stay tuned.
This post is, in part, derived from material for a forthcoming book on mental illness I'm writing. Please come back often to find out how to get free sample chapters.
If things like schizophrenia and alcoholism truly do have a strong genetic basis (as we've been told), and if they were to involve tractable numbers of genes (say, dozens or scores of genes, rather than hundreds or thousands), DNA studies of the GWAS kind should produce immediate, strong, recognizable genetic signatures of disease. The results should leap off the page. But they don't. Time and again, the very few candidate alleles that are found are either found in low numbers, or have low "penetrance" (low capacity to predict disease), or both, and quite often the candidate alleles that are found are not confirmed by followup studies.
This has given rise to major crisis in genetics, summed up in a paper called "Finding the missing heritability of complex diseases" that appeared in Nature in 2009. Scientists are desperate to explain the "missing heratibility" of disorders we "know" are genetic.
It's assumed, by most scientists, that the failure of genome-wide association studies to find genetic explanations of complex diseases can be attributed to such studies' low power and resolution for genetic variations of modest effect. (So the quest has begun to increase the study size of future trials, in hopes of seeing more robust results.) In addition, there's a reasonable expectation that many complex diseases will doubtless be found to involve numerous genes, each contributing only a small effect to the total. It's also been suggested that some traits are dictated by extremely rare genetic features with high "penetrance." There's also an awareness that current laboratory techniques have low power to detect gene-gene interactions. What no scientist wants to say, however—and what none of the 24 co-authors of the Nature article (above) would say—is that maybe the genetic component(s) of schizophrenia, alcoholism, depression, etc. are simply vanishingly small to begin with. Yet that's exactly what the data are telling us. But we won't listen to the data because it doesn't fit our preconception of how the world should work.
We "know" that a person's height is largely controlled by genes. Studies going back almost a century have determined that body height is 80% to 90% heritable; no one seriously questions this fact. (Height is heritable—it "runs in families.") However, at least three large, modern genetic studies have been done to find "height genes"; the largest involved over 180,000 study subjects (and 291 co-authors). In all, some 180 genetic loci were identified that play a role in determining a person's height. But the 180 genomic features, put together, accounted for only 10% of observed variations in height. The rest appears to be environment.
Are we now supposed to enlarge our "study population" from 180,000 to several million, in order to find the genetic explanation for body height, just because we "know" one exists?
At some point, don't we have to just admit "the data's the data"?
Shouldn't we be willing (at least provisionally) to entertain the idea that maybe the twin studies and the data showing that certain things "run in families" (pre-GWAS-era data, mostly) are in need of reevaluation? Shouldn't we at least consider the idea that prior studies misjudged the importance of uncontrolled-for environmental variables? Now that we have powerful DNA-analytic techniques for investigating heritability, and the techniques aren't giving us the results we want, must we "increase the size of the microscopic" to make the results look bigger? Or shouldn't we just accept what the data are telling us (as painful as that may be)?
In a future post, I'll talk about some of the GWAS results for alcoholism. Stay tuned.
This post is, in part, derived from material for a forthcoming book on mental illness I'm writing. Please come back often to find out how to get free sample chapters.
Sunday, March 23, 2014
Nucleus-like viruses and their enzymes
Recent findings in virology have forced biologists to consider many notions that just a few years ago would have seemed heretical and/or science-fiction-like. For example, there is now serious discussion of the possibility that cellular life descended from viruses (the Virus World theory; see also this paper). A growing (but still minority) viewpoint is that viruses should be considered symbionts rather than simply parasites (see the review by Villereal). Some have dared to propose that the eukaryotic cell nucleus actually stemmed from a virus. Others have speculated the reverse: that the large DNA viruses are actually escaped, spore-like nucei. Meanwhile, some say that during an earlier RNA World, viruses became the original inventors of DNA.
There's no question that large viruses of the NCDLV class have nucleus-like properties. Within a short time of infection, these viruses set up a complex structure inside the cell known as the virus factory, and the factory looks a lot like a cell nucleus. The authors of a recent paper on Mimivirus (the famously huge virus that infects freshwater amoeba) admitted that in previous work, they did, in fact, mistake the virus factory for the nucleus. (See photo.)
Macroscopic aspects aside, the large "nucleocytoplasmic" viruses (some of which infect animals and marine life, not just amoeba) bring with them many genes for enzymes that are normally found in a cell nucleus. I'm not talking about genes for DNA polymerases, topoisomerases, etc., but genes that act on small molecules. In a previous post, I mentioned the example of PBCV-1 (a virus that infects the alga Chlorella) having its own gene for aspartate transcarbamylase (ATCase), which is an enzyme that catalyzes the first committed step in pyrimidine synthesis. This enzyme (common to most living things) is predominantly found in the cell nucleus of higher organisms.
There are other examples. Many NCLDV-group viruses have a gene for deoxy-UTP pyrophosphatase, an enzyme that breaks the high-energy phosphates off dUTP so that uracil isn't accidentally incorporated into DNA. One can imagine that after a virus invades a cell and unleashes its nucleases on the cell's own RNA, many ribonucleotides (breakdown products of RNA) will be liberated; and many of these will then be reduced to deoxy-nucleotides (by ribonucleoside-diphosphate reductase) in preparation for viral DNA synthesis. As it happens, dUTP is quite easily incorporated into DNA (and is promiscuous in its Watson-Crick pairing with other nucleobases); the resulting malformed DNA can trigger apoptosis in some cells. The virus takes no chances. It brings its own dUTPase to make sure uracil never gets into its DNA by mistake.
Some viruses bring their own gene for thymidylate synthase, to bring about the conversion of dUMP to dTMP (in other words, methylation of uracil, in its deoxy-ribonucleoside-monophosphate form, to give thymidine monophosphate). Some also have a gene for thymidylate kinase, which converts dTMP (often just called TMP) to dTDP (or TDP).
Yet another "small-molecule" enzyme encoded by large DNA viruses is ribonucleoside-diphosphate reductase (RDPR). This enzyme is fundamental to the whole DNA synthesis enterprise. Its job is to convert ordinary ribonucleotides to the deoxy form that DNA needs. Without this enzyme, you can make RNA but not DNA. So it's typically found in the cell nucleus (in higher organisms).
It turns out, a gene for RDPR is contained in a great many viral genomes. When I did a BLAST search of the protein sequence for Chlorella virus ribonucleoside reductase against the UniProt database of virus sequences, the search came back with 863 hits, spanning viruses belonging not only to the NCDLV class (pox, mimivirus, phycodnaviruses, etc.) but also the Herpesviridae, plus many bacteriophage groups as well. In terms of the sheer variety of virus groups involved, it's hard to think of another "small-molecule-processing" enzyme that spans as many viral taxa. We're talking about everything from relatively small bacteriophages to mimivirus, and lots in between.
The reductase gene is so widespread, it made me wonder what its phylogenetic distribution might look like. In other words: Are viral RDPRs related to each other? Are they related to the host's own RDPR? Does the enzyme's evolution follow the viral path, or the host path?
Just for fun, I obtained a number of ribonucleoside reductase (small subunit) protein sequences for viruses, plants, animals, bacteria, fungi, and various eukaryotic parasites (using the tools at UniProt.org), then fed the results to the tree-maker at http://www.phylogeny.fr. What I got was the following "maximum likelihood" phylogenetic tree. (See this paper for details on the tree algorithm. Also, be sure to check out this nifty paper to learn more about how to read this sort of tree.)
For convenience, names of viruses are depicted in blue. Notice how, except for the Vaccinia-Variola group, which is deeply nested, most of the viral nodes are ancestral to most of the higher-organism nodes; you have to go through many levels of viral ancestors to get from the original, universal ancestor (presuming there was one) to the reductase gene of the pig, say. From this diagram, it would appear that the Pox-family reductase gene is derived, in some way, from a highly evolved host. But that's the exception, not the rule. All of the other viral genes are outgroups and/or, more usually, ancestors of one another.
Mimivirus is fairly high up the chain and shows relatedness to two very common freshwater and soil bacteria (Pseudomonas and Burkholderia).
It would be fun to go back and remake the tree, adding more organisms. (If you end up trying this, let me know the results.) For now, I'm comfortable concluding that except for pox-family viruses, the ribonucleoside reductase produced by major DNA viruses and phages are not derived from current-day hosts. A parsimonious (but not necessarily correct!) explanation is that the phage reductases are ancestral to host orthologs; but it is also possible that the phage reductases derive from very ancient hosts (not depicted in the tree), with current-day hosts appearing to derive from phage genes when in fact the similarity is to a long-ago host ortholog. In any case, the tree shows that organismal RDPRs tend to be related to organismal RDPRs and viral versions are related to viral versions. What we don't see anywhere is a viral sub-tree growing out of a host sub-tree (as would be the case if the viral enzymes simply derived from modern host enzymes).
The UniProt identifiers of the protein sequences used in this study are given below in case you want to try to replicate these results (or perhaps extend them). To retrieve the protein sequences in question, go to http://www.uniprot.org/ and click the Retrieve tab, then Copy and Paste the following sequences (one to a line) exactly as shown:
There's no question that large viruses of the NCDLV class have nucleus-like properties. Within a short time of infection, these viruses set up a complex structure inside the cell known as the virus factory, and the factory looks a lot like a cell nucleus. The authors of a recent paper on Mimivirus (the famously huge virus that infects freshwater amoeba) admitted that in previous work, they did, in fact, mistake the virus factory for the nucleus. (See photo.)
Macroscopic aspects aside, the large "nucleocytoplasmic" viruses (some of which infect animals and marine life, not just amoeba) bring with them many genes for enzymes that are normally found in a cell nucleus. I'm not talking about genes for DNA polymerases, topoisomerases, etc., but genes that act on small molecules. In a previous post, I mentioned the example of PBCV-1 (a virus that infects the alga Chlorella) having its own gene for aspartate transcarbamylase (ATCase), which is an enzyme that catalyzes the first committed step in pyrimidine synthesis. This enzyme (common to most living things) is predominantly found in the cell nucleus of higher organisms.
There are other examples. Many NCLDV-group viruses have a gene for deoxy-UTP pyrophosphatase, an enzyme that breaks the high-energy phosphates off dUTP so that uracil isn't accidentally incorporated into DNA. One can imagine that after a virus invades a cell and unleashes its nucleases on the cell's own RNA, many ribonucleotides (breakdown products of RNA) will be liberated; and many of these will then be reduced to deoxy-nucleotides (by ribonucleoside-diphosphate reductase) in preparation for viral DNA synthesis. As it happens, dUTP is quite easily incorporated into DNA (and is promiscuous in its Watson-Crick pairing with other nucleobases); the resulting malformed DNA can trigger apoptosis in some cells. The virus takes no chances. It brings its own dUTPase to make sure uracil never gets into its DNA by mistake.
Some viruses bring their own gene for thymidylate synthase, to bring about the conversion of dUMP to dTMP (in other words, methylation of uracil, in its deoxy-ribonucleoside-monophosphate form, to give thymidine monophosphate). Some also have a gene for thymidylate kinase, which converts dTMP (often just called TMP) to dTDP (or TDP).
Yet another "small-molecule" enzyme encoded by large DNA viruses is ribonucleoside-diphosphate reductase (RDPR). This enzyme is fundamental to the whole DNA synthesis enterprise. Its job is to convert ordinary ribonucleotides to the deoxy form that DNA needs. Without this enzyme, you can make RNA but not DNA. So it's typically found in the cell nucleus (in higher organisms).
It turns out, a gene for RDPR is contained in a great many viral genomes. When I did a BLAST search of the protein sequence for Chlorella virus ribonucleoside reductase against the UniProt database of virus sequences, the search came back with 863 hits, spanning viruses belonging not only to the NCDLV class (pox, mimivirus, phycodnaviruses, etc.) but also the Herpesviridae, plus many bacteriophage groups as well. In terms of the sheer variety of virus groups involved, it's hard to think of another "small-molecule-processing" enzyme that spans as many viral taxa. We're talking about everything from relatively small bacteriophages to mimivirus, and lots in between.
The reductase gene is so widespread, it made me wonder what its phylogenetic distribution might look like. In other words: Are viral RDPRs related to each other? Are they related to the host's own RDPR? Does the enzyme's evolution follow the viral path, or the host path?
Just for fun, I obtained a number of ribonucleoside reductase (small subunit) protein sequences for viruses, plants, animals, bacteria, fungi, and various eukaryotic parasites (using the tools at UniProt.org), then fed the results to the tree-maker at http://www.phylogeny.fr. What I got was the following "maximum likelihood" phylogenetic tree. (See this paper for details on the tree algorithm. Also, be sure to check out this nifty paper to learn more about how to read this sort of tree.)
For convenience, names of viruses are depicted in blue. Notice how, except for the Vaccinia-Variola group, which is deeply nested, most of the viral nodes are ancestral to most of the higher-organism nodes; you have to go through many levels of viral ancestors to get from the original, universal ancestor (presuming there was one) to the reductase gene of the pig, say. From this diagram, it would appear that the Pox-family reductase gene is derived, in some way, from a highly evolved host. But that's the exception, not the rule. All of the other viral genes are outgroups and/or, more usually, ancestors of one another.
Mimivirus is fairly high up the chain and shows relatedness to two very common freshwater and soil bacteria (Pseudomonas and Burkholderia).
It would be fun to go back and remake the tree, adding more organisms. (If you end up trying this, let me know the results.) For now, I'm comfortable concluding that except for pox-family viruses, the ribonucleoside reductase produced by major DNA viruses and phages are not derived from current-day hosts. A parsimonious (but not necessarily correct!) explanation is that the phage reductases are ancestral to host orthologs; but it is also possible that the phage reductases derive from very ancient hosts (not depicted in the tree), with current-day hosts appearing to derive from phage genes when in fact the similarity is to a long-ago host ortholog. In any case, the tree shows that organismal RDPRs tend to be related to organismal RDPRs and viral versions are related to viral versions. What we don't see anywhere is a viral sub-tree growing out of a host sub-tree (as would be the case if the viral enzymes simply derived from modern host enzymes).
The UniProt identifiers of the protein sequences used in this study are given below in case you want to try to replicate these results (or perhaps extend them). To retrieve the protein sequences in question, go to http://www.uniprot.org/ and click the Retrieve tab, then Copy and Paste the following sequences (one to a line) exactly as shown:
O57175 P33799 M1I7H3 E5ERR7 Q6GZQ8 Q77MS0 P28847 M1I8A4 W0TWG5 Q7T6Y9 Q9HMU4 T0MT29 201403222BWOVN08AD B3ERT4 F2II86 F2L908 U7RFH3 Q4KLN6 I3LUY0 B9RBH6 Q9LSD0 S8GD97 W4I9N3 Q4DFS6 A4HFY2 G3XP91 S8B144
Friday, May 24, 2013
Decrypting DNA
In a previous post ("Information Theory in Three Minutes"), I hinted at the power of information theory to gage redundancy in a language. A fundamental finding of information theory is that when a language uses symbols in such a way that some symbols appear more often than others (for example when vowels turn up more often than consonants, in English), it's a tipoff to redundancy.
DNA is a language with many hidden redundancies. It's a four-letter language, with symbol choices of A, G, C, and T (adenine, guanine, cytosine, and thymine), which means any given symbol should be able to convey two bits' worth of information, since log2(4) is two. But it turns out, different organisms speak different "dialects" of this language. Some organisms use G and C twice as often as A and T, which (if you do the math) means each symbol is actually carrying a maximum of 1.837 bits (not 2 bits) of information.
Consider how an alien visitor to earth might be able to use information theory to figure out terrestrial molecular biology.
The first thing an alien visitor might notice is that there are four "symbols" in DNA (A, G, C, T).
By analyzing the frequencies of various naturally occurring combinations of these letters, the alien would quickly determine that the natural "word length" of DNA is three.
There are 64 possible 3-letter words that can be spelled with a 4-letter alphabet. So in theory, a 3-letter "word" in DNA should convey 6 bits worth of information (since 2 to the 6th power is 64). But an alien would look at many samples of earthly DNA, from many creatures, and do a summation of -F * log2(F) for every 3-letter "word" used by a given creature's DNA (where F is simply the frequency of usage of the 3-letter combo). From this sort of analysis, the alien would find that even though 64 different codons (3-letter words) are, in fact, being used in earthly DNA, in actuality the entropy per codon in some cases is as little as 4.524 bits. (Or at least, it approaches that value asymptotically.)
Since 2 to the 4.524 power is 23, and since proteins (the predominant macromolecule in earthly biology) are made of amino acids, a canny alien would surmise that there must be around 23 different amino acids; and earthly DNA is a language for mapping 3-letters words to those 23 amino acids.
As it turns out, the genetic code does use 3-letter "words" (codons) to specify amino acids, but there are 20 amino acids (not 23), with 3 "stop codons" reserved for telling the cell's protein-making machinery "this is the end of this protein; stop here."
The above chart shows the actual codon usage pattern for E. coli. Note that all organisms use the same 3-letter codes for the same amino acids, and most organisms use all 64 possible codons, but the codons are used with vastly unequal frequencies. If you look in the upper right corner of the above chart, for example, you'll see that E. coli uses CTG (one of the six codons for Leucine) far more often than CTA (another codon for Leucine). One of the open questions in biology is why organisms favor certain synonymous codons over others (a phenomenon called codon usage bias).
While DNA's 6-bit codon bandwidth permits 64 different codons, and while organisms do generally make use of all 64 codons, the uneven usage pattern means fewer than 6 bits of information are used per codon. To get the actual codon entropy, all you have to do is take each usage frequency and calculate -F * log2(F) for each codon, then sum. If you do that for E. coli, you get 5.679 bits per codon. As it happens, E. coli actually does make use of almost all the available bandwidth (of 6 bits) in its codons. This turns out not to be true for all organisms, however.
DNA is a language with many hidden redundancies. It's a four-letter language, with symbol choices of A, G, C, and T (adenine, guanine, cytosine, and thymine), which means any given symbol should be able to convey two bits' worth of information, since log2(4) is two. But it turns out, different organisms speak different "dialects" of this language. Some organisms use G and C twice as often as A and T, which (if you do the math) means each symbol is actually carrying a maximum of 1.837 bits (not 2 bits) of information.
Consider how an alien visitor to earth might be able to use information theory to figure out terrestrial molecular biology.
The first thing an alien visitor might notice is that there are four "symbols" in DNA (A, G, C, T).
By analyzing the frequencies of various naturally occurring combinations of these letters, the alien would quickly determine that the natural "word length" of DNA is three.
There are 64 possible 3-letter words that can be spelled with a 4-letter alphabet. So in theory, a 3-letter "word" in DNA should convey 6 bits worth of information (since 2 to the 6th power is 64). But an alien would look at many samples of earthly DNA, from many creatures, and do a summation of -F * log2(F) for every 3-letter "word" used by a given creature's DNA (where F is simply the frequency of usage of the 3-letter combo). From this sort of analysis, the alien would find that even though 64 different codons (3-letter words) are, in fact, being used in earthly DNA, in actuality the entropy per codon in some cases is as little as 4.524 bits. (Or at least, it approaches that value asymptotically.)
Since 2 to the 4.524 power is 23, and since proteins (the predominant macromolecule in earthly biology) are made of amino acids, a canny alien would surmise that there must be around 23 different amino acids; and earthly DNA is a language for mapping 3-letters words to those 23 amino acids.
As it turns out, the genetic code does use 3-letter "words" (codons) to specify amino acids, but there are 20 amino acids (not 23), with 3 "stop codons" reserved for telling the cell's protein-making machinery "this is the end of this protein; stop here."
![]() |
| E. coli codon usage. |
While DNA's 6-bit codon bandwidth permits 64 different codons, and while organisms do generally make use of all 64 codons, the uneven usage pattern means fewer than 6 bits of information are used per codon. To get the actual codon entropy, all you have to do is take each usage frequency and calculate -F * log2(F) for each codon, then sum. If you do that for E. coli, you get 5.679 bits per codon. As it happens, E. coli actually does make use of almost all the available bandwidth (of 6 bits) in its codons. This turns out not to be true for all organisms, however.
Monday, May 06, 2013
Hydrogen Peroxide Powers Evolution
I'm about to offer a conjecture that is a bit preposterous-sounding but could well hold true. I actually think it does.
I propose that evolution, at the level of bacteria (though probably not at higher levels), is driven by hydrogen peroxide.
This theory rests on three assumptions: One is that the creation of new bacterial species happens almost entirely via lateral gene transfer, not heritable point-mutations. Secondly, bacteria (marine and terrestrial) are regularly exposed to challenges by hydrogen peroxide in the environment. Thirdly, those challenges drive lateral gene transfer.
Evidence for the first assumption is embarrassingly abundant. If you're not up to speed on the subject, I suggest you read the excellent paper, "Lateral Gene Transfer," by Olga Zhaxybayeva and W. Ford Doolittle in Current Biology, April 2011, 21:7, pp. R242-246 (unlocked copy here). It's now common to find that any given bacterial species can trace a good percentage of its protein base to "ancestors" that are too far removed horizontally to be ancestors in the conventional sense.
Consider E. coli. There are hundreds of strains of E. coli, with genes ranging in number from 4,100 to about 5,300 per strain. The problem is, the various strains of E. coli have only about 900 genes in common (and that's far too few genes to render a fully functional E. coli). The E. coli pan-genome actually takes in more than 15,000 gene families, total. Certainly, you can draw a family tree of E. coli based on 16S ribosomal polymorphisms, but that doesn't explain where the 15,000 pan-genome genes came from. The "family tree" metaphor quickly breaks down if you start drawing trees based on proteins. You get many conflicting trees—all of them correct.
Where are all of the genes coming from? Other species, of course. They arrive by way of mechanisms like transformation, transduction, and conjugation. all of which allow direct entry of foreign DNA into a bacterial cell. At one time it was thought that conjugation could only occur between bacteria of the same species, but it is now known that cross-species conjugation also occurs (as, for example, between E. coli and Streptomyces or Mycobacterium).
Transduction, which is where viruses package up an infected host's genes in virus capsules that are then taken up by another cell, occurs naturally in bacterial populations in response to environmental factors like ultraviolet light and hydrogen peroxide. Exposure of a virus-carrying (lysogenic) cell to UV light or peroxide can induce runaway production of virus, and in fact this mechanism is used by Streptococcus to kill competitive Staphylococcus cells, in a clever bit of chemical warfare. It's been known for years that hydrogen peroxide can cause many types of bacteria to shed DNA. Now we know why: Hydrogen peroxide is a signalling molecule. It signals (among other things) lysogenic bacteria to go into a lytic cycle. It also signals cells to mount what's known as the SOS response, which is a global response to oxidative challenge. Years ago, Bruce Ames and his colleagues showed that exposing Salmonella to very dilute (60 micromolar) hydrogen peroxide caused the cells to differentially express 30 "SOS" proteins, including heat-shock proteins and low-fidelity DNA-repair systems. We know that hydrogen peroxide as dilute as 0.1 micromolar can induce phage (virus) production in up to 11% of marine bacteria. This is significant, because rainwater contains hydrogen peroxide in concentrations of 2 to 40 micromolar, and ocean water has been known to reach millimolar levels of H2O2 after a rain storm.
If you're wondering why rain contains hydrogen peroxide, the peroxide gets there in two ways. One is UV-frequency photochemistry (where water is cleaved to H and OH, then reforms as H2 and H2O2); the other is via ionization reactions caused by lightning. (Lightning is energetic enough to bring airborne oxygen and water to a plasma state. The resulting ionization and rearrangement of free atoms yields a certain amount of hydrogen peroxide.) The presence of H2O2 in rainwater has been confirmed many times, and in fact there's a well-preserved "fossil record" of it in polar icepacks, going back centuries. (Polar snowpacks contain from 10 to 900 ppb of H2O2; it varies seasonally, the max coming in summer.)
Bottom line, every rain event (over land, over sea) constitutes a hydrogen peroxide challenge for microbes. Which induces viral transduction (and a release of whole-cell DNA through lysis, some of which will be inevitably be used in transformation). It also induces low-fidelity DNA repair (which is guaranteed to help evolution along). Every rain event, in other words, is a chance for evolution to do its thing. For bacteria, that means gene-sharing within and across species lines.
W. Ford Doolittle (who wrote a classic book chapter about lateral gene transfer called "If the Tree of Life Fell, Would We Recognize the Sound?") estimates that if a horizontal gene transfer occurs once every ten billion vertical replications, "it would be enough to ensure that no gene in any modern genome has an unbroken history of vertical descent back to some hypothetical last universal common ancestor." (See this article.)
It's obvious (to me, at least) that every rain event carries with it the potential to cause far more gene transfers than are necessary (according to Doolittle) to make vertical inheritance fade into insignificance as an evolutionary bringer of change. The hydrogen peroxide in rain has been driving lateral gene transfer in bacteria for eons. In fact, it is arguably the dominant driver of evolution in bacteria.
Sorry, Mr. Darwin. Point mutations handed down to sons and daughters just isn't cutting it.
I propose that evolution, at the level of bacteria (though probably not at higher levels), is driven by hydrogen peroxide.
This theory rests on three assumptions: One is that the creation of new bacterial species happens almost entirely via lateral gene transfer, not heritable point-mutations. Secondly, bacteria (marine and terrestrial) are regularly exposed to challenges by hydrogen peroxide in the environment. Thirdly, those challenges drive lateral gene transfer.
Evidence for the first assumption is embarrassingly abundant. If you're not up to speed on the subject, I suggest you read the excellent paper, "Lateral Gene Transfer," by Olga Zhaxybayeva and W. Ford Doolittle in Current Biology, April 2011, 21:7, pp. R242-246 (unlocked copy here). It's now common to find that any given bacterial species can trace a good percentage of its protein base to "ancestors" that are too far removed horizontally to be ancestors in the conventional sense.
Consider E. coli. There are hundreds of strains of E. coli, with genes ranging in number from 4,100 to about 5,300 per strain. The problem is, the various strains of E. coli have only about 900 genes in common (and that's far too few genes to render a fully functional E. coli). The E. coli pan-genome actually takes in more than 15,000 gene families, total. Certainly, you can draw a family tree of E. coli based on 16S ribosomal polymorphisms, but that doesn't explain where the 15,000 pan-genome genes came from. The "family tree" metaphor quickly breaks down if you start drawing trees based on proteins. You get many conflicting trees—all of them correct.
![]() |
| Trees like this are fiction where bacteria are concerned. The tree of life is more like a net of life or web of life than a directed acyclic graph. |
Transduction, which is where viruses package up an infected host's genes in virus capsules that are then taken up by another cell, occurs naturally in bacterial populations in response to environmental factors like ultraviolet light and hydrogen peroxide. Exposure of a virus-carrying (lysogenic) cell to UV light or peroxide can induce runaway production of virus, and in fact this mechanism is used by Streptococcus to kill competitive Staphylococcus cells, in a clever bit of chemical warfare. It's been known for years that hydrogen peroxide can cause many types of bacteria to shed DNA. Now we know why: Hydrogen peroxide is a signalling molecule. It signals (among other things) lysogenic bacteria to go into a lytic cycle. It also signals cells to mount what's known as the SOS response, which is a global response to oxidative challenge. Years ago, Bruce Ames and his colleagues showed that exposing Salmonella to very dilute (60 micromolar) hydrogen peroxide caused the cells to differentially express 30 "SOS" proteins, including heat-shock proteins and low-fidelity DNA-repair systems. We know that hydrogen peroxide as dilute as 0.1 micromolar can induce phage (virus) production in up to 11% of marine bacteria. This is significant, because rainwater contains hydrogen peroxide in concentrations of 2 to 40 micromolar, and ocean water has been known to reach millimolar levels of H2O2 after a rain storm.
If you're wondering why rain contains hydrogen peroxide, the peroxide gets there in two ways. One is UV-frequency photochemistry (where water is cleaved to H and OH, then reforms as H2 and H2O2); the other is via ionization reactions caused by lightning. (Lightning is energetic enough to bring airborne oxygen and water to a plasma state. The resulting ionization and rearrangement of free atoms yields a certain amount of hydrogen peroxide.) The presence of H2O2 in rainwater has been confirmed many times, and in fact there's a well-preserved "fossil record" of it in polar icepacks, going back centuries. (Polar snowpacks contain from 10 to 900 ppb of H2O2; it varies seasonally, the max coming in summer.)
Bottom line, every rain event (over land, over sea) constitutes a hydrogen peroxide challenge for microbes. Which induces viral transduction (and a release of whole-cell DNA through lysis, some of which will be inevitably be used in transformation). It also induces low-fidelity DNA repair (which is guaranteed to help evolution along). Every rain event, in other words, is a chance for evolution to do its thing. For bacteria, that means gene-sharing within and across species lines.
![]() |
| Darwin's theory of a tree-like ancestor basis for all living things is dead wrong, at least for bacteria. |
It's obvious (to me, at least) that every rain event carries with it the potential to cause far more gene transfers than are necessary (according to Doolittle) to make vertical inheritance fade into insignificance as an evolutionary bringer of change. The hydrogen peroxide in rain has been driving lateral gene transfer in bacteria for eons. In fact, it is arguably the dominant driver of evolution in bacteria.
Sorry, Mr. Darwin. Point mutations handed down to sons and daughters just isn't cutting it.
Subscribe to:
Posts (Atom)




