Sunday, June 29, 2014

A Large Genome with Lots of Structure

In contrast to higher life forms, bacteria usually have compact genomes, with few duplicate genes and very little non-coding DNA. But some bacteria, for reasons not entirely understood, accumulate relatively large genomes. A good example is Sorangium cellulosum, a soil-dweller of the Myxococcales group, whose 14-million-base-pair genome (comprising 10,400 protein-coding genes) dwarfs that of E. coli B (with 4.6 million base pairs and 4,205 genes) and makes the 476-gene genome of Mycoplasma genitalium look puny. Bear in mind that the fruit fly genome contains about 14,000 protein genes (although several times that number of proteins may be produced through alternative splicing of exons).

Exactly why S. cellulosum needs more genes than, for example, baker's yeast (with 12 million base pairs and around 6,700 genes) is anybody's guess. It does have many accessory genes for producing secondary metabolites with interesting antifungal, antibacterial, and other properties (including anti-tumor properties). As a result, many labs are busy mining the Sorangium genome for genes of possible commercial importance.
Sorangium cellulosum

I recently decided to poke around inside the genome of S. cellulosum myself, looking for evidence of latent secondary structure (internal complementarity regions) in its genes. I was stunned at what I found. When I had my scripts look for complementing intragenic 11-mers (pairs of complementary sequencces of length 11), I found over 36,000 such pairs in Sorangium's genes.

Next, I went a step further and checked each gene for internal complementing sequences of length 14.

Based on Sorangium's actual A, G, C, and T composition stats, and considering all the kinds of 14-mers that actually exist in the coding regions of the genome, I expected to find 991 matching (complementing) pairs of 14-mers in 10,400 genes. What I actually found were 2,942 matching pairs inside 1,928 genes.

To make this clearer, I plotted the expected number of complementary 14-mers per gene versus the actual number per gene, in a graph:

Expected vs. actual complementing intragenic 14-mers for Sorangium. Expectation statistics were calculated individually, for each gene, based on actual A,G,C,T composition stats and gene length.
The points are arrayed in horizontal lines because while expectations can be calculated to several decimal points, actual occurrences are discrete (whole numbers).

It's fairly evident that length-14 complementary pairs tend to occur at higher than the expected rate(s). In fact, that's true for 98% of occurrences. It might not be obvious (from the above plot) that 98% of points lie above the 1:1 slope line, but that's only because so many points overlap each other.

The bottom line? These results strongly suggest that substantial amounts of secondary structure exist in a significant fraction of Sorangium's genes. The secondary structure could be tied to thermal regulation of gene expression (via RNA thermometers), or some mRNAs could incorporate metallo-sensitive riboswitches; or maybe secondary structure in mRNA is important for certain translocon-targeted genes. There could be other explanations as well. (If you have one, leave a comment.)

Why is this important? For one thing, we need a better understanding of how an organism with 10,400 protein genes regulates and coordinates gene expression. Secondary structure of regulatory and coding-region RNA might well hold important clues. But also, if secondary structure is conserved in large numbers of genes (as I believe it is), it has to affect codon bias. "Complementing" codons would be preferred (at least for certain regions) over non-complementing codons, and this would affect codon choice. It's a factor that has not been considered, to date, in arguments over why codon bias exists.

Friday, June 27, 2014

A Tiny Genome with Lots of Structure

Mycoplasma genitalium is a super-tiny bacterial parasite of the human urinary system, causing about 15% of urinary infections in men. Its genome, encoding just 476 protein-coding genes, is often cited as the smallest genome of any organism that can be grown in pure culture. The genome is so small that scientists have for years used M. genitalium as a kind of litmus test for the essentiality of genes: If a particular kind of gene exists in M. genitalium (the reasoning goes), it must be truly essential for life.
Mycoplasma genitalium

Recently, I looked at the genome of M. genitalium from the point of view of latent nucleic-acid secondary structure. (Secondary structure refers to the ability of a single strand of DNA or RNA to base-pair with itself.) I probed each gene with a script designed to detect intragenic self-complementing regions of length eleven (so-called 11-mers), the idea being that complementary runs of bases shorter than that might occur at a high rate by chance. Given M. genitalium's base composition stats (adenine being most prevalent, at 36.24% of all bases in protein-coding message regions, thymine being next-most-abundant, at 32.21% of protein-coding bases), the most likely 11-mer, 'AAAAAAAAAAA', could be expected to occur by chance once every 70,672 bases. In actuality, that particular 11-mer doesn't occur in the M. genitalium genome, but if it did we could expect around 8 occurrences of it in a genome of 528,500 base pairs. Instead, what we actually find are 749 occurrences of complementary 11-mers in 283 genes.

The procedure used to check for 11-mers is as follows (in pseudocode):

for ( i = 0; i < genes.length; i++ )  // for each gene
    for ( k = 0; k < genes[i].length - 11; k++ ) // for all bases
        sequence = genes[i].substring( k, k + 11 ) // get 11 bases
        complement = getReverseComplement( sequence ) 
        if ( genes[i].match( complement ) ) // a match exists
        if ( matchNotWithinMask( mask ) ) // not inside a previous match
        if ( matchNotInsideSequence( sequence ) ) // not inside sequence 
             matches++    // count it as a hit
             updateMask( )  // update mask

Note that with even-length sequences it would be important to guard against self-matching sequences, since a sequence like AGCT is its own reverse-complement. With length-11 sequences this is not an issue. Nevertheless, it's possible for successive matches to overlap, and I wrote a check to guard against that. That's what  matchNotWithinMask( mask ) and updateMask() are all about.

It should go without saying, but I'll say it anyway: 749 complementing 11-mers in 283 genes is a very substantial (and quite unexpected) amount of internal complementarity. The question is: What's the purpose of all that (putative) secondary structure?

One possibility (which I've talked about in a previous post) is that the secondary structure allows for thermostatic control of gene expression. RNA thermometers are a well established phenomenon and it could be that M. genitalium needs sensitive control over thermal expression of certain genes.

Another possibility is that many genes incorporate magnesium-, manganese-, or calcium-sensitive riboswitches. RNA is a potent chelator of metal ions, particularly doubly-charged ions like Mg+2, which is smaller than sodium or potassium (with twice the charge) and thus has unusually high charge density. It's possible that under conditions of osmotic stress (with high influx of water and low ion concentrations) certain mRNAs relax or uncoil and thereby become translatable. If this is true (if certain genes are under osmotic control), we might expect to find that many M. genitalium genes with high secondary structure potential are membrane-targeted genes tasked with managing the import and export of various things. And that's indeed what we find.

Further below, I present a table with the top 100 M. genitalium genes containing the most putative secondary structure based on the 11-mer probe outlined above. The table shows the gene name, gene product or function (where known), gene size in base pairs, and the number of complementing 11-mers in the gene. The genes tend to fall into only a few categories. About a third of the genes (34 out of 100) are DNA- or RNA-binding genes. A dozen genes are transporters or permeases; another 12 fall in the category of "other membrane-associated genes" (these are marked with asterisks); and ten encode lipoproteins. Sixteen are "hypothetical proteins" (which should probably be reannotated as proteins of unknown function, since we can be fairly sure the genes are expressed).

It's interesting that so many genes with (putative) secondary-structure potential are nucleic acid processing genes. Among genes in this category are ten genes that either acylate or modify transfer RNAs. What's interesting about the latter group is that most involve tRNAs for non-polar amino acids (alanine, leucine, isoleucine, valine, phenylalanine, and methionine). The reason this is interesting is that non-polar amino acids are extensively used in membrane-associated proteins. Thus we have a situation, possibly, in which osmo-switches (osmotically sensitive mRNAs) control the expression of tRNA synthetases for amino acids used in membrane proteins.

It is quite possible that some of the (few) metabolic genes listed in the table are membrane-associated. This is likely true for phosphomannomutase, UDP-galactopyranose mutase, and glycerophosphoryl diester phosphodiesterase, for example. Also interesting is that 2,3-bisphosphoglycerate-independent phosphoglycerate mutase from Thermoplasma has been shown to be manganese-stimulated. Riboswitch modulation of such an enzyme would not be unexpected.

Altogether, 34 to 37 out of 100 genes listed in the table are membrane-associated, and another 7 are tRNA synthetases involving non-polar amino acids (heavily used in membrane proteins), tending to support the hypothesis that M. genitalium uses secondary structure of mRNA (and/or ssDNA) to modulate gene expression in osmotically sensitive manner.

Table 1. Genes with high secondary structure potential in M. genitalium. Genes marked with asterisks are membrane-associated genes other than transporters or permeases. The final column shows the number of complementary 11-mer pairs found in the gene.
Gene Product
Size (bp)
11-mers
MG_468 ABC transporter, permease protein
5353
22
MG_064 ABC transporter, permease protein, putative
3997
17
MG_218 HMW2 cytadherence accessory protein *
5419
16
MG_386 P200 protein
4852
15
MG_414 conserved hypothetical protein
3112
13
MG_075 116 kDa surface antigen *
3076
12
MG_422 conserved hypothetical protein
2509
12
MG_244 UvrD/REP helicase
2113
11
MG_191 MgPa adhesin *
4336
11
MG_292 alanyl-tRNA synthetase
2704
10
MG_018 helicase SNF2 family, putative
3097
10
MG_298 chromosome segregation protein SMC
2950
10
MG_345 isoleucyl-tRNA synthetase
2689
9
MG_338 lipoprotein, putative
3814
9
MG_307 lipoprotein, putative
3535
8
MG_031 DNA polymerase III, alpha subunit, Gram-positive type
4357
8
MG_080 oligopeptide ABC transporter, ATP-binding protein
2548
8
MG_321 lipoprotein, putative
2806
7
MG_340 DNA-directed RNA polymerase, beta' subunit
3880
7
MG_069 PTS system, glucose-specific IIABC component *
2728
7
MG_390 ABC transporter, ATP-binding/permease protein
1984
7
MG_525 conserved hypothetical protein
1996
7
MG_226 amino acid-polyamine-organocation (APC) permease family protein
1480
6
MG_378 arginyl-tRNA synthetase
1615
6
MG_192 P110 protein
3163
6
MG_312 HMW1 cytadherence accessory protein *
3421
6
MG_328 conserved hypothetical protein
2272
6
MG_341 DNA-directed RNA polymerase, beta subunit
4174
6
MG_291 phosphonate ABC transporter, permease protein (P69), putative
1633
5
MG_001 DNA polymerase III, beta subunit
1144
5
MG_411 phosphate ABC transporter, permease protein PstA
1966
5
MG_136 lysyl-tRNA synthetase
1474
5
MG_336 aminotransferase, class V
1228
5
MG_261 DNA polymerase III, alpha subunit
2626
5
MG_430 2,3-bisphosphoglycerate-independent phosphoglycerate mutase
1525
5
MG_053 phosphoglucomutase/phosphomannomutase, putative
1654
5
MG_195 phenylalanyl-tRNA synthetase, beta subunit
2422
5
MG_260 lipoprotein, putative
2299
5
MG_277 membrane protein, putative *
2914
5
MG_366 conserved hypothetical protein
2005
5
MG_250 DNA primase
1825
4
MG_447 membrane protein, putative *
1645
4
MG_254 DNA ligase, NAD-dependent
1981
4
MG_003 DNA gyrase, B subunit
1954
4
MG_397 conserved hypothetical protein
1702
4
MG_096 conserved hypothetical protein
1954
4
MG_334 valyl-tRNA synthetase
2515
4
MG_141 transcription termination factor NusA
1597
4
MG_241 conserved hypothetical protein
1864
4
MG_278 GTP pyrophosphokinase
2164
4
MG_012 alpha-L-glutamate ligases, RimK family, putative
865
4
MG_123 conserved hypothetical protein
1417
4
MG_266 leucyl-tRNA synthetase
2380
4
MG_119 ABC transporter, ATP-binding protein
1696
4
MG_068 lipoprotein, putative
1426
3
MG_047 S-adenosylmethionine synthetase
1153
3
MG_456 conserved hypothetical protein
1006
3
MG_122 DNA topoisomerase I
2131
3
MG_421 excinuclease ABC, A subunit
2866
3
MG_223 conserved hypothetical protein
1237
3
MG_375 threonyl-tRNA synthetase
1696
3
MG_364 expressed protein of unknown function
676
3
MG_185 lipoprotein, putative
2107
3
MG_423 conserved hypothetical protein
1687
3
MG_067 lipoprotein, putative
1552
3
MG_008 tRNA modification GTPase TrmE
1330
3
MG_306 membrane protein, putative *
1183
3
MG_242 expressed protein of unknown function
1894
3
MG_040 lipoprotein, putative
1777
3
MG_281 conserved hypothetical protein
1672
3
MG_259 modification methylase, HemK family
1372
3
MG_204 DNA topoisomerase IV, A subunit
2347
3
MG_216 pyruvate kinase
1528
3
MG_229 ribonucleoside-diphosphate reductase, beta chain
1024
3
MG_187 ABC transporter, ATP-binding protein
1759
3
MG_094 replicative DNA helicase
1408
3
MG_184 adenine-specific DNA modification methylase
955
3
MG_303 metal ion ABC transporter, ATP-binding protein, putative
1075
3
MG_045 spermidine/putrescine ABC transporter, spermidine/putrescine binding protein, putative
1453
3
MG_051 pyrimidine-nucleoside phosphorylase
1267
3
MG_004 DNA gyrase, A subunit
2512
3
MG_385 glycerophosphoryl diester phosphodiesterase family protein *
712
3
MG_360 ImpB/MucB/SamB family protein
1237
3
MG_457 ATP-dependent metalloprotease FtsH *
2110
3
MG_065 ABC transporter, ATP-binding protein
1402
3
MG_419 DNA polymerase III, subunit gamma and tau
1795
3
MG_295 tRNA (5-methylaminomethyl-2-thiouridylate)-methyltransferase
1105
3
MG_072 preprotein translocase, SecA subunit *
2422
3
MG_464 membrane protein, putative *
1159
3
MG_089 translation elongation factor G
2068
2
MG_439 lipoprotein, putative
820
2
MG_309 lipoprotein, putative
3679
2
MG_032 conserved hypothetical protein
2002
2
MG_194 phenylalanyl-tRNA synthetase, alpha subunit
1027
2
MG_137 UDP-galactopyranose mutase
1216
2
MG_203 DNA topoisomerase IV, B subunit
1903
2
MG_021 methionyl-tRNA synthetase
1540
2
MG_029 DJ-1/PfpI family protein
562
2
MG_255 conserved hypothetical protein
1099
2
MG_314 conserved hypothetical protein
1333
2

Thursday, June 26, 2014

Books Everyone Says Everyone Should Read

I found the graphic below in David McCandless's book Information Is Beautiful and felt it was worth sharing; I couldn't stop looking at it. Click to enlarge.


Sad to say, I've read shockingly few of these titles (maybe twenty percent of them?). Some, I read at too early an age and remember dimly. Most of the ones I did read do, IMHO, deserve to be on this list. I take issue, however, with The Da Vinci Code (an execrable dung-pile between two covers) and have to wonder why The Chronicles of Narnia made the list but Gravity's Rainbow didn't. Certainly Catch-22 deserves its prominent position in the word-cloud, but does the turgid Dune (look to the right of Lolita) deserve to displace The Road or [pick your favorite dystopian epic]? Does Stranger in a Strange Land (a wooden, laughably kitsch fable about a man from Mars) even hold a candle to a book like Flowers for Algernon? Does Ayn Rand rate not one, but two placements (for The Fountainhead and Atlas Shrugged)? Really??

Of course, David McCandless did not choose the items on this list. No single person did. It was drawn from Oprah's Book Club List, Goodreads.com, and other public sources. (Read the fine print at the bottom.) Garbage in, garbage out. Still, as infographics go, and considering how many word-clouds all of us (by now) have seen, it's surprisingly compelling.

Wednesday, June 25, 2014

Why So Many Helicases?

Many DNA-processing genes have an unusual amount of internal complementarity: regions of DNA in which the DNA can fold back on itself to form stable structures. A good example is the dinG gene of Mycobacterium tuberculosis, which encodes an ATP-dependent helicase. Using the DINAMelt server's Quikfold app, I obtained the following structure prediction for the first 1,000 bases of the (1,971-bases-long) dinG gene of M. tuberculosis Erdman strain.

Structure prediction for one strand of the M. tuberculosis dinG gene (first 1,000 bases). Almost the entire sequence folds back on itself. The only part of the original sequence that doesn't self-anneal is the tiny straight line at the lower right (red arrow). Click to enlarge.

Remarkably, almost the entire sequence can form a stable self-annealing structure. Only a few bases (see red arrow, above) lack the ability to form secondary structure. Bear in mind, what we're looking at is a stable conformation involving one strand of DNA only. (Each of the two strands of dinG can form this structure, independently of one another.) The structure shown above has a Tm (melting temperature) of 66.4°C in 1M saline, with mean-free-energy enthalpy of minus-2418.60 kcal/mol and a 37°C Gibbs free energy (ΔG) of minus-209.92 kcal, meaning that at 37°C, formation of the stable structure shown here (or one very much like it) is, energetically speaking, strongly favored.

Structures of this kind are often considered to occur in RNA, but if they also occur in single-stranded DNA, it raises interesting questions. If a gene has an energetically stable strands-apart configuration, getting the strands of duplex B-form DNA to separate might not be so hard. But more to the point, getting the self-annealing gene to come back together again as duplex DNA will require significant energy input. In molecular genetics, we're accustomed to the idea of duplex DNA requiring help from an ATP-dependent helicase to "open up" (unwind) the double helix in preparation for replication or transcription. The above diagram suggests that the problem isn't "opening up" a gene; the greater problem may be bringing the strands together again after they've assumed a stable strands-apart secondary structure. There's a substantial energy barrier to be overcome before the above structure can be relaxed into randomly coiling DNA.

This suggests that certain genes, like dinG, may be modal in terms of strand-separation state. Once the gene's strands are apart, they want to stay apart. There's an energy barrier to bringing the strands together again.

It's ironic that a helicase gene (dinG) has so much single-strand secondary structure. The gene product is a DNA-powered helicase; the gene needs its own protein product in order to zip up again. But maybe that's the point? Maybe it's a non-accidental feature.

In general, bacteria tend to have a remarkable number of helicase genes. M. tuberculosis (for example) has 16 different helicases. Other species have even more. (See table.)

Organism
Helicase genes
Myxococcus xanthus strain DK 1622
42
Frankia sp. strain QA3
41
Streptomyces cf. griseus strain XylebKG-1
39
Clostridium botulinum Hall strain
26
Psychromonas ingrahamii strain 37
25
Mesorhizobium sp. strain BNC1
21
Bacillus cereus strain F837/76
21
Escherichia coli B rel606
19
Anabaena cylindrica strain PCC 7122
19
Mycobacterium tuberculosis Erdman strain
16
Caulobacter crescentus strain NA1000
15

One might ask why this is so; why would a bacterium need 15, 19, 25, or 42 different helicases? It's quite unusual for a bacterial genome to have significant redundancy of genes, because when there are two copies of a given gene, one copy usually eventually becomes disabled (pseudogenized) and lost through random mutations. The very few exceptions to this rule tend to involve highly transcribed, highly necessary genes (such as ribosomal-RNA genes). It would be extremely unlikely for M. tuberculosis to carry around 16 "flavors" of a gene if they weren't all absolutely necessary. Duplicates would almost certainly be lost over time, especially in M. tuberculosis, which lacks a mismatch repair system. (Bacteria that lack mismatch repair enzymes have been shown experimentally to lose DNA fifty times faster than other bacteria.) The most parsimonious view is that the 16 helicases of M. tuberculosis are, in fact, critically necessary and perform different jobs.

I would suggest that perhaps the reason bacteria have so many helicases is that these are actually the "nucleic acid chaperones" that manage secondary structure in the sizable minority of genes that exhibit pronounced self-annealing of separated strands. It could be that most helicases are tasked with separating individual strands of DNA (and/or RNA) from themselves. Different types of secondary structure require different types of helicase to unravel. This might be why bacteria need so many helicases.

References
Helpful articles on DNA secondary structure:
  • Dimitrov, R. A. & Zuker, M. (2004) Prediction of hybridization and melting for double-stranded nucleic acids. Biophys. J., 87, 215-226.
    [Abstract] [Full Text] [PDF]
  • SantaLucia, Jr., J. (1998) A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc. Natl. Acad. Sci. USA, 95, 1460-1465.
    [Abstract] [Full Text] [PDF]
  • Walter, A. E., Turner, D. H., Kim, J., Lyttle, M. H., Müller, P., Mathews, D. H. & Zuker, M. (1994) Coaxial stacking of helixes enhances binding of oligoribonucleotides and improves predictions of RNA folding. Proc. Natl. Acad. Sci. USA, 91, 9218-9222.
    [Abstract] [Full Text] [PDF]