Friday, June 27, 2014

A Tiny Genome with Lots of Structure

Mycoplasma genitalium is a super-tiny bacterial parasite of the human urinary system, causing about 15% of urinary infections in men. Its genome, encoding just 476 protein-coding genes, is often cited as the smallest genome of any organism that can be grown in pure culture. The genome is so small that scientists have for years used M. genitalium as a kind of litmus test for the essentiality of genes: If a particular kind of gene exists in M. genitalium (the reasoning goes), it must be truly essential for life.
Mycoplasma genitalium

Recently, I looked at the genome of M. genitalium from the point of view of latent nucleic-acid secondary structure. (Secondary structure refers to the ability of a single strand of DNA or RNA to base-pair with itself.) I probed each gene with a script designed to detect intragenic self-complementing regions of length eleven (so-called 11-mers), the idea being that complementary runs of bases shorter than that might occur at a high rate by chance. Given M. genitalium's base composition stats (adenine being most prevalent, at 36.24% of all bases in protein-coding message regions, thymine being next-most-abundant, at 32.21% of protein-coding bases), the most likely 11-mer, 'AAAAAAAAAAA', could be expected to occur by chance once every 70,672 bases. In actuality, that particular 11-mer doesn't occur in the M. genitalium genome, but if it did we could expect around 8 occurrences of it in a genome of 528,500 base pairs. Instead, what we actually find are 749 occurrences of complementary 11-mers in 283 genes.

The procedure used to check for 11-mers is as follows (in pseudocode):

for ( i = 0; i < genes.length; i++ )  // for each gene
    for ( k = 0; k < genes[i].length - 11; k++ ) // for all bases
        sequence = genes[i].substring( k, k + 11 ) // get 11 bases
        complement = getReverseComplement( sequence ) 
        if ( genes[i].match( complement ) ) // a match exists
        if ( matchNotWithinMask( mask ) ) // not inside a previous match
        if ( matchNotInsideSequence( sequence ) ) // not inside sequence 
             matches++    // count it as a hit
             updateMask( )  // update mask

Note that with even-length sequences it would be important to guard against self-matching sequences, since a sequence like AGCT is its own reverse-complement. With length-11 sequences this is not an issue. Nevertheless, it's possible for successive matches to overlap, and I wrote a check to guard against that. That's what  matchNotWithinMask( mask ) and updateMask() are all about.

It should go without saying, but I'll say it anyway: 749 complementing 11-mers in 283 genes is a very substantial (and quite unexpected) amount of internal complementarity. The question is: What's the purpose of all that (putative) secondary structure?

One possibility (which I've talked about in a previous post) is that the secondary structure allows for thermostatic control of gene expression. RNA thermometers are a well established phenomenon and it could be that M. genitalium needs sensitive control over thermal expression of certain genes.

Another possibility is that many genes incorporate magnesium-, manganese-, or calcium-sensitive riboswitches. RNA is a potent chelator of metal ions, particularly doubly-charged ions like Mg+2, which is smaller than sodium or potassium (with twice the charge) and thus has unusually high charge density. It's possible that under conditions of osmotic stress (with high influx of water and low ion concentrations) certain mRNAs relax or uncoil and thereby become translatable. If this is true (if certain genes are under osmotic control), we might expect to find that many M. genitalium genes with high secondary structure potential are membrane-targeted genes tasked with managing the import and export of various things. And that's indeed what we find.

Further below, I present a table with the top 100 M. genitalium genes containing the most putative secondary structure based on the 11-mer probe outlined above. The table shows the gene name, gene product or function (where known), gene size in base pairs, and the number of complementing 11-mers in the gene. The genes tend to fall into only a few categories. About a third of the genes (34 out of 100) are DNA- or RNA-binding genes. A dozen genes are transporters or permeases; another 12 fall in the category of "other membrane-associated genes" (these are marked with asterisks); and ten encode lipoproteins. Sixteen are "hypothetical proteins" (which should probably be reannotated as proteins of unknown function, since we can be fairly sure the genes are expressed).

It's interesting that so many genes with (putative) secondary-structure potential are nucleic acid processing genes. Among genes in this category are ten genes that either acylate or modify transfer RNAs. What's interesting about the latter group is that most involve tRNAs for non-polar amino acids (alanine, leucine, isoleucine, valine, phenylalanine, and methionine). The reason this is interesting is that non-polar amino acids are extensively used in membrane-associated proteins. Thus we have a situation, possibly, in which osmo-switches (osmotically sensitive mRNAs) control the expression of tRNA synthetases for amino acids used in membrane proteins.

It is quite possible that some of the (few) metabolic genes listed in the table are membrane-associated. This is likely true for phosphomannomutase, UDP-galactopyranose mutase, and glycerophosphoryl diester phosphodiesterase, for example. Also interesting is that 2,3-bisphosphoglycerate-independent phosphoglycerate mutase from Thermoplasma has been shown to be manganese-stimulated. Riboswitch modulation of such an enzyme would not be unexpected.

Altogether, 34 to 37 out of 100 genes listed in the table are membrane-associated, and another 7 are tRNA synthetases involving non-polar amino acids (heavily used in membrane proteins), tending to support the hypothesis that M. genitalium uses secondary structure of mRNA (and/or ssDNA) to modulate gene expression in osmotically sensitive manner.

Table 1. Genes with high secondary structure potential in M. genitalium. Genes marked with asterisks are membrane-associated genes other than transporters or permeases. The final column shows the number of complementary 11-mer pairs found in the gene.
Gene Product
Size (bp)
11-mers
MG_468 ABC transporter, permease protein
5353
22
MG_064 ABC transporter, permease protein, putative
3997
17
MG_218 HMW2 cytadherence accessory protein *
5419
16
MG_386 P200 protein
4852
15
MG_414 conserved hypothetical protein
3112
13
MG_075 116 kDa surface antigen *
3076
12
MG_422 conserved hypothetical protein
2509
12
MG_244 UvrD/REP helicase
2113
11
MG_191 MgPa adhesin *
4336
11
MG_292 alanyl-tRNA synthetase
2704
10
MG_018 helicase SNF2 family, putative
3097
10
MG_298 chromosome segregation protein SMC
2950
10
MG_345 isoleucyl-tRNA synthetase
2689
9
MG_338 lipoprotein, putative
3814
9
MG_307 lipoprotein, putative
3535
8
MG_031 DNA polymerase III, alpha subunit, Gram-positive type
4357
8
MG_080 oligopeptide ABC transporter, ATP-binding protein
2548
8
MG_321 lipoprotein, putative
2806
7
MG_340 DNA-directed RNA polymerase, beta' subunit
3880
7
MG_069 PTS system, glucose-specific IIABC component *
2728
7
MG_390 ABC transporter, ATP-binding/permease protein
1984
7
MG_525 conserved hypothetical protein
1996
7
MG_226 amino acid-polyamine-organocation (APC) permease family protein
1480
6
MG_378 arginyl-tRNA synthetase
1615
6
MG_192 P110 protein
3163
6
MG_312 HMW1 cytadherence accessory protein *
3421
6
MG_328 conserved hypothetical protein
2272
6
MG_341 DNA-directed RNA polymerase, beta subunit
4174
6
MG_291 phosphonate ABC transporter, permease protein (P69), putative
1633
5
MG_001 DNA polymerase III, beta subunit
1144
5
MG_411 phosphate ABC transporter, permease protein PstA
1966
5
MG_136 lysyl-tRNA synthetase
1474
5
MG_336 aminotransferase, class V
1228
5
MG_261 DNA polymerase III, alpha subunit
2626
5
MG_430 2,3-bisphosphoglycerate-independent phosphoglycerate mutase
1525
5
MG_053 phosphoglucomutase/phosphomannomutase, putative
1654
5
MG_195 phenylalanyl-tRNA synthetase, beta subunit
2422
5
MG_260 lipoprotein, putative
2299
5
MG_277 membrane protein, putative *
2914
5
MG_366 conserved hypothetical protein
2005
5
MG_250 DNA primase
1825
4
MG_447 membrane protein, putative *
1645
4
MG_254 DNA ligase, NAD-dependent
1981
4
MG_003 DNA gyrase, B subunit
1954
4
MG_397 conserved hypothetical protein
1702
4
MG_096 conserved hypothetical protein
1954
4
MG_334 valyl-tRNA synthetase
2515
4
MG_141 transcription termination factor NusA
1597
4
MG_241 conserved hypothetical protein
1864
4
MG_278 GTP pyrophosphokinase
2164
4
MG_012 alpha-L-glutamate ligases, RimK family, putative
865
4
MG_123 conserved hypothetical protein
1417
4
MG_266 leucyl-tRNA synthetase
2380
4
MG_119 ABC transporter, ATP-binding protein
1696
4
MG_068 lipoprotein, putative
1426
3
MG_047 S-adenosylmethionine synthetase
1153
3
MG_456 conserved hypothetical protein
1006
3
MG_122 DNA topoisomerase I
2131
3
MG_421 excinuclease ABC, A subunit
2866
3
MG_223 conserved hypothetical protein
1237
3
MG_375 threonyl-tRNA synthetase
1696
3
MG_364 expressed protein of unknown function
676
3
MG_185 lipoprotein, putative
2107
3
MG_423 conserved hypothetical protein
1687
3
MG_067 lipoprotein, putative
1552
3
MG_008 tRNA modification GTPase TrmE
1330
3
MG_306 membrane protein, putative *
1183
3
MG_242 expressed protein of unknown function
1894
3
MG_040 lipoprotein, putative
1777
3
MG_281 conserved hypothetical protein
1672
3
MG_259 modification methylase, HemK family
1372
3
MG_204 DNA topoisomerase IV, A subunit
2347
3
MG_216 pyruvate kinase
1528
3
MG_229 ribonucleoside-diphosphate reductase, beta chain
1024
3
MG_187 ABC transporter, ATP-binding protein
1759
3
MG_094 replicative DNA helicase
1408
3
MG_184 adenine-specific DNA modification methylase
955
3
MG_303 metal ion ABC transporter, ATP-binding protein, putative
1075
3
MG_045 spermidine/putrescine ABC transporter, spermidine/putrescine binding protein, putative
1453
3
MG_051 pyrimidine-nucleoside phosphorylase
1267
3
MG_004 DNA gyrase, A subunit
2512
3
MG_385 glycerophosphoryl diester phosphodiesterase family protein *
712
3
MG_360 ImpB/MucB/SamB family protein
1237
3
MG_457 ATP-dependent metalloprotease FtsH *
2110
3
MG_065 ABC transporter, ATP-binding protein
1402
3
MG_419 DNA polymerase III, subunit gamma and tau
1795
3
MG_295 tRNA (5-methylaminomethyl-2-thiouridylate)-methyltransferase
1105
3
MG_072 preprotein translocase, SecA subunit *
2422
3
MG_464 membrane protein, putative *
1159
3
MG_089 translation elongation factor G
2068
2
MG_439 lipoprotein, putative
820
2
MG_309 lipoprotein, putative
3679
2
MG_032 conserved hypothetical protein
2002
2
MG_194 phenylalanyl-tRNA synthetase, alpha subunit
1027
2
MG_137 UDP-galactopyranose mutase
1216
2
MG_203 DNA topoisomerase IV, B subunit
1903
2
MG_021 methionyl-tRNA synthetase
1540
2
MG_029 DJ-1/PfpI family protein
562
2
MG_255 conserved hypothetical protein
1099
2
MG_314 conserved hypothetical protein
1333
2