Friday, May 23, 2014

Looking for LUCA

In 1964, Emile Zuckerkandl and Linus Pauling wrote a paper (published the following year) for the Journal of Theoretical Biology suggesting the use of amino-acid and nucleic-acid sequences for deducing phylogenetic relationships. Ever since then, biologists have been trying to use sequence data to get to the root of the tree of life. Darwinian logic says that at some point, all cells had to have diverged from a Last Universal Common Ancestor (LUCA). Unfortunately, as pointed out by Doolittle and others, the quest for LUCA is greatly complicated by mutational saturation effects, reductive genome loss in important members of the most ancient taxa, convergent evolution, and non-negligible (yet difficult to estimate) amounts of horizontal gene transfer, among other serious problems.

An evolutionary tree of life based on analysis of N=420 genomes of free-living organisms. Proteomes are taxa and protein fold superfamilies are character data. Adapted from Kim and Caetano-Anollés, BMC Evolutionary Biology (2011), 11:140. Click to enlarge. See text for discussion.

The difficulty (I won't say folly) of trying to construct a well-rooted tree of life is made evident in various failed attempts to trace common descent via protein sequences. In January 2010, a few months after the sequencing of the 1000th bacterial genome, Karin Lagesen, Dave W. Ussery, and Trudy M. Wassenaar published a paper in which they expressed surprise over the fact that when they looked all 1,000 then-existing genomes (the number is now more than 10 times that), they could not find a single protein that was conserved across all bacteria. (Here, "conserved" means >50% amino-acid sequence identity.) Harris et al. took a slightly different approach, using the Clusters of Orthologous Groups (COG) database to search for universally conserved genes that follow the same phylogenetic patterns as ribosomal RNA (and therefore might constitute the ancestral genetic core of today's cells). The upshot:
Of the roughly 3100 COGs analyzed, only 80 were found to occur in all organisms. Fifty of these universally present genes showed the same phylogenetic relationships as rRNA.
Harris et al. found that the majority of universally conserved three-domain COG genes (37 of 50) are physically associated with the ribosome. Surprisingly, they found that "relatively few genes encoding proteins involved in DNA replication or transcription from DNA to RNA proved to be three-domain." In particular, RNA polymerases (except for certain subunits) did not follow rRNA distribution patterns and are not conserved across the three domains of life (archaea, bacteria, and eukaryotes). Moreover, the only component of the replicative DNA polymerase in modern cells that was found to be conserved across domains was DnaN (COG0592), the gene for the “sliding clamp.”

These disappointing results are understandable and perhaps expected, given the huge amount of deck-reshuffling that's happened in three billion years. It might well be that genome sequence data, with its constant churn, represents the wrong level of granularity for deep-phylogenetic studies. What matters for organisms, after all, is function, and function is an outcome of protein tertiary structure, not just primary structure.

With that in mind, Kyung Mo Kim and Gustavo Caetano-Anollés in 2011 published a brilliant study in BMC Evolutionary Biology called "The proteomic complexity and rise of the primordial ancestor of diversified life," relying on major structural motifs as the unit of phylogenetic discrimination. Defining protein domains at the highly conserved fold superfamily (FSF) level of structure, Kim and Caetano-Anollés used an iterative, parsimony-based phylogenomic approach to reconstructing FSF repertoires as upper and lower bounds of a presumed urancestral proteome ("ur" here meaning universal). Their conclusion:
The minimum urancestral FSF set reveals the urancestor had advanced metabolic capabilities, was especially rich in nucleotide metabolism enzymes, had pathways for the biosynthesis of membrane sn1,2 glycerol ester and ether lipids, and had crucial elements of translation, including a primordial ribosome with protein synthesis capabilities. It lacked however fundamental functions, including transcription, processes for extracellular communication, and enzymes for deoxyribonucleotide synthesis. Proteomic history reveals the urancestor is closer to a simple progenote organism but harbors a rather complex set of modern molecular functions.
The paper is quite long (14,700 words) and often relentlessly technical, but convincingly restores the quest for LUCA to the firm empirical grounding that such a quest seemed (for a while) to have been robbed of after Doolittle's "Uprooting the Tree of Life" and Dagan and Martin's "The Tree of One Percent." 

While parasitic microorganisms were found to occupy some of the most ancient branches of the superkingdom tree, Kim and Caetano-Anollés nevertheless decided to omit such organisms from their study since reductive evolution (wholesale loss of entire families of enzymes and their control systems) might otherwise queer the results. The final set of free-living organisms included 48 archaeal, 239 bacterial, and 133 eukaryotic members. To avoid potential problems with long-branch attraction, the researchers wisely sampled (at random) equal numbers of proteomes per superkingdom and replicated trees of proteomes, so that bacterial data (which of course predominated) wouldn't swamp archaea or eukaryota.

Among the many fascinating findings in the study:
  • The earliest start of organismal diversification occurred sometime between 2.91 and 2.03 billion years ago.
  • Translation had metabolic origins. It appeared only after the emergence "of a large number of metabolic functions, but before enzymes necessary for the synthesis of DNA."
  • Proteomic analysis of extant fold superfamilies (FSFs) showed that "over 200 additional FSFs are necessary in urancestral FSF sets to account for the complexity of the simplest organism in existence today."
  • None of the domains present in ribonucleotide reductase (RDR) enzymes was present in the min_set (representing the LUCA lower bound of complexity). Further, "We note that the reduction of ribonucleotides to deoxyribonucleotides involves the production of an active site thiyl radical that requires contacts with cysteines in all protein domains of the catalytic subunit of the oligomeric enzymatic complex, suggesting modern ribonucleotide reductase functions is [sic] indeed derived."
  • Commenting on the known active-site domain homology between class III ribonucleotide reductase and pyruvate formate lyase (a link proposed to have mediated the RNA-to-DNA biological transition), Kim and Caetano-Anollés point out that phylogenomic analysis at the fold-family level suggests the pyruvate formate-lyase domain emerged later than its ribonucleotide reductase counterpart. Therefore it's likely that the urancestor stored genetic information as RNA and not DNA.
Kim and Caetano-Anollés note: "The urancestor had an advanced metabolic network, especially rich in nucleotide metabolism enzymes, had primordial pathways for the biosynthesis of membrane glycerol ether and ester lipids, crucial elements of translation, including amino-acyl tRNA synthases, regulatory factors, and a primordial ribosome with protein synthesis capabilities. It lacked however transcription and in advanced evolutionary stages stored genetic information in RNA (not DNA) molecules."

The authors have many interesting things to say about the evolution of archaeal and bacterial membrane-lipid chemistry (and much else). If you're a biologist and you haven't yet read the Kim and Caetano-Anollés paper, do yourself  favor and take a look at it now. It's a fascinating read, no matter what side of the LUCA fence you're on.