blogorrhea: June 2014

Sunday, June 29, 2014

A Large Genome with Lots of Structure

In contrast to higher life forms, bacteria usually have compact genomes, with few duplicate genes and very little non-coding DNA. But some bacteria, for reasons not entirely understood, accumulate relatively large genomes. A good example is Sorangium cellulosum, a soil-dweller of the Myxococcales group, whose 14-million-base-pair genome (comprising 10,400 protein-coding genes) dwarfs that of E. coli B (with 4.6 million base pairs and 4,205 genes) and makes the 476-gene genome of Mycoplasma genitalium look puny. Bear in mind that the fruit fly genome contains about 14,000 protein genes (although several times that number of proteins may be produced through alternative splicing of exons).

Exactly why S. cellulosum needs more genes than, for example, baker's yeast (with 12 million base pairs and around 6,700 genes) is anybody's guess. It does have many accessory genes for producing secondary metabolites with interesting antifungal, antibacterial, and other properties (including anti-tumor properties). As a result, many labs are busy mining the Sorangium genome for genes of possible commercial importance.

Sorangium cellulosum

I recently decided to poke around inside the genome of S. cellulosum myself, looking for evidence of latent secondary structure (internal complementarity regions) in its genes. I was stunned at what I found. When I had my scripts look for complementing intragenic 11-mers (pairs of complementary sequencces of length 11), I found over 36,000 such pairs in Sorangium's genes.

Next, I went a step further and checked each gene for internal complementing sequences of length 14.

Based on Sorangium's actual A, G, C, and T composition stats, and considering all the kinds of 14-mers that actually exist in the coding regions of the genome, I expected to find 991 matching (complementing) pairs of 14-mers in 10,400 genes. What I actually found were 2,942 matching pairs inside 1,928 genes.

To make this clearer, I plotted the expected number of complementary 14-mers per gene versus the actual number per gene, in a graph:

Expected vs. actual complementing intragenic 14-mers for Sorangium. Expectation statistics were calculated individually, for each gene, based on actual A,G,C,T composition stats and gene length.

The points are arrayed in horizontal lines because while expectations can be calculated to several decimal points, actual occurrences are discrete (whole numbers).

It's fairly evident that length-14 complementary pairs tend to occur at higher than the expected rate(s). In fact, that's true for 98% of occurrences. It might not be obvious (from the above plot) that 98% of points lie above the 1:1 slope line, but that's only because so many points overlap each other.

The bottom line? These results strongly suggest that substantial amounts of secondary structure exist in a significant fraction of Sorangium's genes. The secondary structure could be tied to thermal regulation of gene expression (via RNA thermometers), or some mRNAs could incorporate metallo-sensitive riboswitches; or maybe secondary structure in mRNA is important for certain translocon-targeted genes. There could be other explanations as well. (If you have one, leave a comment.)

Why is this important? For one thing, we need a better understanding of how an organism with 10,400 protein genes regulates and coordinates gene expression. Secondary structure of regulatory and coding-region RNA might well hold important clues. But also, if secondary structure is conserved in large numbers of genes (as I believe it is), it has to affect codon bias. "Complementing" codons would be preferred (at least for certain regions) over non-complementing codons, and this would affect codon choice. It's a factor that has not been considered, to date, in arguments over why codon bias exists.

Friday, June 27, 2014

A Tiny Genome with Lots of Structure

Mycoplasma genitalium is a super-tiny bacterial parasite of the human urinary system, causing about 15% of urinary infections in men. Its genome, encoding just 476 protein-coding genes, is often cited as the smallest genome of any organism that can be grown in pure culture. The genome is so small that scientists have for years used M. genitalium as a kind of litmus test for the essentiality of genes: If a particular kind of gene exists in M. genitalium (the reasoning goes), it must be truly essential for life.

Mycoplasma genitalium

Recently, I looked at the genome of M. genitalium from the point of view of latent nucleic-acid secondary structure. (Secondary structure refers to the ability of a single strand of DNA or RNA to base-pair with itself.) I probed each gene with a script designed to detect intragenic self-complementing regions of length eleven (so-called 11-mers), the idea being that complementary runs of bases shorter than that might occur at a high rate by chance. Given M. genitalium's base composition stats (adenine being most prevalent, at 36.24% of all bases in protein-coding message regions, thymine being next-most-abundant, at 32.21% of protein-coding bases), the most likely 11-mer, 'AAAAAAAAAAA', could be expected to occur by chance once every 70,672 bases. In actuality, that particular 11-mer doesn't occur in the M. genitalium genome, but if it did we could expect around 8 occurrences of it in a genome of 528,500 base pairs. Instead, what we actually find are 749 occurrences of complementary 11-mers in 283 genes.

The procedure used to check for 11-mers is as follows (in pseudocode):

for ( i = 0; i < genes.length; i++ )  // for each gene
    for ( k = 0; k < genes[i].length - 11; k++ ) // for all bases
        sequence = genes[i].substring( k, k + 11 ) // get 11 bases
        complement = getReverseComplement( sequence ) 
        if ( genes[i].match( complement ) ) // a match exists
        if ( matchNotWithinMask( mask ) ) // not inside a previous match
        if ( matchNotInsideSequence( sequence ) ) // not inside sequence 
             matches++    // count it as a hit
             updateMask( )  // update mask

Note that with even-length sequences it would be important to guard against self-matching sequences, since a sequence like AGCT is its own reverse-complement. With length-11 sequences this is not an issue. Nevertheless, it's possible for successive matches to overlap, and I wrote a check to guard against that. That's what matchNotWithinMask( mask ) and updateMask() are all about.

It should go without saying, but I'll say it anyway: 749 complementing 11-mers in 283 genes is a very substantial (and quite unexpected) amount of internal complementarity. The question is: What's the purpose of all that (putative) secondary structure?

One possibility (which I've talked about in a previous post) is that the secondary structure allows for thermostatic control of gene expression. RNA thermometers are a well established phenomenon and it could be that M. genitalium needs sensitive control over thermal expression of certain genes.

Another possibility is that many genes incorporate magnesium-, manganese-, or calcium-sensitive riboswitches. RNA is a potent chelator of metal ions, particularly doubly-charged ions like Mg⁺², which is smaller than sodium or potassium (with twice the charge) and thus has unusually high charge density. It's possible that under conditions of osmotic stress (with high influx of water and low ion concentrations) certain mRNAs relax or uncoil and thereby become translatable. If this is true (if certain genes are under osmotic control), we might expect to find that many M. genitalium genes with high secondary structure potential are membrane-targeted genes tasked with managing the import and export of various things. And that's indeed what we find.

Further below, I present a table with the top 100 M. genitalium genes containing the most putative secondary structure based on the 11-mer probe outlined above. The table shows the gene name, gene product or function (where known), gene size in base pairs, and the number of complementing 11-mers in the gene. The genes tend to fall into only a few categories. About a third of the genes (34 out of 100) are DNA- or RNA-binding genes. A dozen genes are transporters or permeases; another 12 fall in the category of "other membrane-associated genes" (these are marked with asterisks); and ten encode lipoproteins. Sixteen are "hypothetical proteins" (which should probably be reannotated as proteins of unknown function, since we can be fairly sure the genes are expressed).

It's interesting that so many genes with (putative) secondary-structure potential are nucleic acid processing genes. Among genes in this category are ten genes that either acylate or modify transfer RNAs. What's interesting about the latter group is that most involve tRNAs for non-polar amino acids (alanine, leucine, isoleucine, valine, phenylalanine, and methionine). The reason this is interesting is that non-polar amino acids are extensively used in membrane-associated proteins. Thus we have a situation, possibly, in which osmo-switches (osmotically sensitive mRNAs) control the expression of tRNA synthetases for amino acids used in membrane proteins.

It is quite possible that some of the (few) metabolic genes listed in the table are membrane-associated. This is likely true for phosphomannomutase, UDP-galactopyranose mutase, and glycerophosphoryl diester phosphodiesterase, for example. Also interesting is that 2,3-bisphosphoglycerate-independent phosphoglycerate mutase from Thermoplasma has been shown to be manganese-stimulated. Riboswitch modulation of such an enzyme would not be unexpected.

Altogether, 34 to 37 out of 100 genes listed in the table are membrane-associated, and another 7 are tRNA synthetases involving non-polar amino acids (heavily used in membrane proteins), tending to support the hypothesis that M. genitalium uses secondary structure of mRNA (and/or ssDNA) to modulate gene expression in osmotically sensitive manner.

Table 1. Genes with high secondary structure potential in M. genitalium. Genes marked with asterisks are membrane-associated genes other than transporters or permeases. The final column shows the number of complementary 11-mer pairs found in the gene.

Gene	Product	Size (bp)	11-mers
MG_468	ABC transporter, permease protein	5353	22
MG_064	ABC transporter, permease protein, putative	3997	17
MG_218	HMW2 cytadherence accessory protein *	5419	16
MG_386	P200 protein	4852	15
MG_414	conserved hypothetical protein	3112	13
MG_075	116 kDa surface antigen *	3076	12
MG_422	conserved hypothetical protein	2509	12
MG_244	UvrD/REP helicase	2113	11
MG_191	MgPa adhesin *	4336	11
MG_292	alanyl-tRNA synthetase	2704	10
MG_018	helicase SNF2 family, putative	3097	10
MG_298	chromosome segregation protein SMC	2950	10
MG_345	isoleucyl-tRNA synthetase	2689	9
MG_338	lipoprotein, putative	3814	9
MG_307	lipoprotein, putative	3535	8
MG_031	DNA polymerase III, alpha subunit, Gram-positive type	4357	8
MG_080	oligopeptide ABC transporter, ATP-binding protein	2548	8
MG_321	lipoprotein, putative	2806	7
MG_340	DNA-directed RNA polymerase, beta' subunit	3880	7
MG_069	PTS system, glucose-specific IIABC component *	2728	7
MG_390	ABC transporter, ATP-binding/permease protein	1984	7
MG_525	conserved hypothetical protein	1996	7
MG_226	amino acid-polyamine-organocation (APC) permease family protein	1480	6
MG_378	arginyl-tRNA synthetase	1615	6
MG_192	P110 protein	3163	6
MG_312	HMW1 cytadherence accessory protein *	3421	6
MG_328	conserved hypothetical protein	2272	6
MG_341	DNA-directed RNA polymerase, beta subunit	4174	6
MG_291	phosphonate ABC transporter, permease protein (P69), putative	1633	5
MG_001	DNA polymerase III, beta subunit	1144	5
MG_411	phosphate ABC transporter, permease protein PstA	1966	5
MG_136	lysyl-tRNA synthetase	1474	5
MG_336	aminotransferase, class V	1228	5
MG_261	DNA polymerase III, alpha subunit	2626	5
MG_430	2,3-bisphosphoglycerate-independent phosphoglycerate mutase	1525	5
MG_053	phosphoglucomutase/phosphomannomutase, putative	1654	5
MG_195	phenylalanyl-tRNA synthetase, beta subunit	2422	5
MG_260	lipoprotein, putative	2299	5
MG_277	membrane protein, putative *	2914	5
MG_366	conserved hypothetical protein	2005	5
MG_250	DNA primase	1825	4
MG_447	membrane protein, putative *	1645	4
MG_254	DNA ligase, NAD-dependent	1981	4
MG_003	DNA gyrase, B subunit	1954	4
MG_397	conserved hypothetical protein	1702	4
MG_096	conserved hypothetical protein	1954	4
MG_334	valyl-tRNA synthetase	2515	4
MG_141	transcription termination factor NusA	1597	4
MG_241	conserved hypothetical protein	1864	4
MG_278	GTP pyrophosphokinase	2164	4
MG_012	alpha-L-glutamate ligases, RimK family, putative	865	4
MG_123	conserved hypothetical protein	1417	4
MG_266	leucyl-tRNA synthetase	2380	4
MG_119	ABC transporter, ATP-binding protein	1696	4
MG_068	lipoprotein, putative	1426	3
MG_047	S-adenosylmethionine synthetase	1153	3
MG_456	conserved hypothetical protein	1006	3
MG_122	DNA topoisomerase I	2131	3
MG_421	excinuclease ABC, A subunit	2866	3
MG_223	conserved hypothetical protein	1237	3
MG_375	threonyl-tRNA synthetase	1696	3
MG_364	expressed protein of unknown function	676	3
MG_185	lipoprotein, putative	2107	3
MG_423	conserved hypothetical protein	1687	3
MG_067	lipoprotein, putative	1552	3
MG_008	tRNA modification GTPase TrmE	1330	3
MG_306	membrane protein, putative *	1183	3
MG_242	expressed protein of unknown function	1894	3
MG_040	lipoprotein, putative	1777	3
MG_281	conserved hypothetical protein	1672	3
MG_259	modification methylase, HemK family	1372	3
MG_204	DNA topoisomerase IV, A subunit	2347	3
MG_216	pyruvate kinase	1528	3
MG_229	ribonucleoside-diphosphate reductase, beta chain	1024	3
MG_187	ABC transporter, ATP-binding protein	1759	3
MG_094	replicative DNA helicase	1408	3
MG_184	adenine-specific DNA modification methylase	955	3
MG_303	metal ion ABC transporter, ATP-binding protein, putative	1075	3
MG_045	spermidine/putrescine ABC transporter, spermidine/putrescine binding protein, putative	1453	3
MG_051	pyrimidine-nucleoside phosphorylase	1267	3
MG_004	DNA gyrase, A subunit	2512	3
MG_385	glycerophosphoryl diester phosphodiesterase family protein *	712	3
MG_360	ImpB/MucB/SamB family protein	1237	3
MG_457	ATP-dependent metalloprotease FtsH *	2110	3
MG_065	ABC transporter, ATP-binding protein	1402	3
MG_419	DNA polymerase III, subunit gamma and tau	1795	3
MG_295	tRNA (5-methylaminomethyl-2-thiouridylate)-methyltransferase	1105	3
MG_072	preprotein translocase, SecA subunit *	2422	3
MG_464	membrane protein, putative *	1159	3
MG_089	translation elongation factor G	2068	2
MG_439	lipoprotein, putative	820	2
MG_309	lipoprotein, putative	3679	2
MG_032	conserved hypothetical protein	2002	2
MG_194	phenylalanyl-tRNA synthetase, alpha subunit	1027	2
MG_137	UDP-galactopyranose mutase	1216	2
MG_203	DNA topoisomerase IV, B subunit	1903	2
MG_021	methionyl-tRNA synthetase	1540	2
MG_029	DJ-1/PfpI family protein	562	2
MG_255	conserved hypothetical protein	1099	2
MG_314	conserved hypothetical protein	1333	2

Thursday, June 26, 2014

Books Everyone Says Everyone Should Read

I found the graphic below in David McCandless's book Information Is Beautiful and felt it was worth sharing; I couldn't stop looking at it. Click to enlarge.

Sad to say, I've read shockingly few of these titles (maybe twenty percent of them?). Some, I read at too early an age and remember dimly. Most of the ones I did read do, IMHO, deserve to be on this list. I take issue, however, with The Da Vinci Code (an execrable dung-pile between two covers) and have to wonder why The Chronicles of Narnia made the list but Gravity's Rainbow didn't. Certainly Catch-22 deserves its prominent position in the word-cloud, but does the turgid Dune (look to the right of Lolita) deserve to displace The Road or [pick your favorite dystopian epic]? Does Stranger in a Strange Land (a wooden, laughably kitsch fable about a man from Mars) even hold a candle to a book like Flowers for Algernon? Does Ayn Rand rate not one, but two placements (for The Fountainhead and Atlas Shrugged)? Really??

Of course, David McCandless did not choose the items on this list. No single person did. It was drawn from Oprah's Book Club List, Goodreads.com, and other public sources. (Read the fine print at the bottom.) Garbage in, garbage out. Still, as infographics go, and considering how many word-clouds all of us (by now) have seen, it's surprisingly compelling.

Wednesday, June 25, 2014

Why So Many Helicases?

Many DNA-processing genes have an unusual amount of internal complementarity: regions of DNA in which the DNA can fold back on itself to form stable structures. A good example is the dinG gene of Mycobacterium tuberculosis, which encodes an ATP-dependent helicase. Using the DINAMelt server's Quikfold app, I obtained the following structure prediction for the first 1,000 bases of the (1,971-bases-long) dinG gene of M. tuberculosis Erdman strain.

Structure prediction for one strand of the M. tuberculosis dinG gene (first 1,000 bases). Almost the entire sequence folds back on itself. The only part of the original sequence that doesn't self-anneal is the tiny straight line at the lower right (red arrow). Click to enlarge.

Remarkably, almost the entire sequence can form a stable self-annealing structure. Only a few bases (see red arrow, above) lack the ability to form secondary structure. Bear in mind, what we're looking at is a stable conformation involving one strand of DNA only. (Each of the two strands of dinG can form this structure, independently of one another.) The structure shown above has a Tm (melting temperature) of 66.4°C in 1M saline, with mean-free-energy enthalpy of minus-2418.60 kcal/mol and a 37°C Gibbs free energy (ΔG) of minus-209.92 kcal, meaning that at 37°C, formation of the stable structure shown here (or one very much like it) is, energetically speaking, strongly favored.

Structures of this kind are often considered to occur in RNA, but if they also occur in single-stranded DNA, it raises interesting questions. If a gene has an energetically stable strands-apart configuration, getting the strands of duplex B-form DNA to separate might not be so hard. But more to the point, getting the self-annealing gene to come back together again as duplex DNA will require significant energy input. In molecular genetics, we're accustomed to the idea of duplex DNA requiring help from an ATP-dependent helicase to "open up" (unwind) the double helix in preparation for replication or transcription. The above diagram suggests that the problem isn't "opening up" a gene; the greater problem may be bringing the strands together again after they've assumed a stable strands-apart secondary structure. There's a substantial energy barrier to be overcome before the above structure can be relaxed into randomly coiling DNA.

This suggests that certain genes, like dinG, may be modal in terms of strand-separation state. Once the gene's strands are apart, they want to stay apart. There's an energy barrier to bringing the strands together again.

It's ironic that a helicase gene (dinG) has so much single-strand secondary structure. The gene product is a DNA-powered helicase; the gene needs its own protein product in order to zip up again. But maybe that's the point? Maybe it's a non-accidental feature.

In general, bacteria tend to have a remarkable number of helicase genes. M. tuberculosis (for example) has 16 different helicases. Other species have even more. (See table.)

Organism	Helicase genes
Myxococcus xanthus strain DK 1622	42
Frankia sp. strain QA3	41
Streptomyces cf. griseus strain XylebKG-1	39
Clostridium botulinum Hall strain	26
Psychromonas ingrahamii strain 37	25
Mesorhizobium sp. strain BNC1	21
Bacillus cereus strain F837/76	21
Escherichia coli B rel606	19
Anabaena cylindrica strain PCC 7122	19
Mycobacterium tuberculosis Erdman strain	16
Caulobacter crescentus strain NA1000	15

One might ask why this is so; why would a bacterium need 15, 19, 25, or 42 different helicases? It's quite unusual for a bacterial genome to have significant redundancy of genes, because when there are two copies of a given gene, one copy usually eventually becomes disabled (pseudogenized) and lost through random mutations. The very few exceptions to this rule tend to involve highly transcribed, highly necessary genes (such as ribosomal-RNA genes). It would be extremely unlikely for M. tuberculosis to carry around 16 "flavors" of a gene if they weren't all absolutely necessary. Duplicates would almost certainly be lost over time, especially in M. tuberculosis, which lacks a mismatch repair system. (Bacteria that lack mismatch repair enzymes have been shown experimentally to lose DNA fifty times faster than other bacteria.) The most parsimonious view is that the 16 helicases of M. tuberculosis are, in fact, critically necessary and perform different jobs.

I would suggest that perhaps the reason bacteria have so many helicases is that these are actually the "nucleic acid chaperones" that manage secondary structure in the sizable minority of genes that exhibit pronounced self-annealing of separated strands. It could be that most helicases are tasked with separating individual strands of DNA (and/or RNA) from themselves. Different types of secondary structure require different types of helicase to unravel. This might be why bacteria need so many helicases.

References
Helpful articles on DNA secondary structure:

Dimitrov, R. A. & Zuker, M. (2004) Prediction of hybridization and melting for double-stranded nucleic acids. Biophys. J., 87, 215-226.
[Abstract] [Full Text] [PDF]
SantaLucia, Jr., J. (1998) A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc. Natl. Acad. Sci. USA, 95, 1460-1465.
[Abstract] [Full Text] [PDF]
Walter, A. E., Turner, D. H., Kim, J., Lyttle, M. H., Müller, P., Mathews, D. H. & Zuker, M. (1994) Coaxial stacking of helixes enhances binding of oligoribonucleotides and improves predictions of RNA folding. Proc. Natl. Acad. Sci. USA, 91, 9218-9222.
[Abstract] [Full Text] [PDF]