Showing posts with label Clostridium botulinum. Show all posts
Showing posts with label Clostridium botulinum. Show all posts

Monday, May 26, 2014

Shannon Entropy and DNA

Claude Shannon made an important finding when he realized that the information contribution of a symbol could be estimated very simply as -f(x) log(f(x)), where f(x) is the frequency of occurrence of the symbol x. For example, a series of a coin tosses can be considered a binary information stream with symbol values H and T (heads and tails). If the frequency of H is 0.5 and f(T) is 0.5, the entropy E, in bits per toss, is -0.5 times log (base 2) 0.5 for heads, and a similar value for tails. The values add up (in this case) to 1.0. The intuitive meaning of 1.0 (the Shannon entropy) is that a single coin toss conveys 1.0 bit of information. Contrast this with the situation that prevails when using a "weighted" or unfair penny that lands heads-up 70% of the time. We know intuitively that tossing such a coin will produce less information because we can predict the outcome (heads), to a degree. Something that's predictable is uninformative. Shannon's equation gives -0.7 times log(0.7) = 0.3602 for heads and -0.3 * log (0.3) = 0.5211 for tails, for an entropy of 0.8813 bits per toss. In this case we can say that a toss is 11.87% redundant.

Claude Shannon
DNA is an information stream resembling a series of four-sided-coin tosses, where the "coin" can land with values of A, T, G, or C. In some organisms, the four bases occur with equal rates (25% each), in which case the DNA has a Shannon entropy of 2.0 bits per base (which makes sense, in that a base can encode one of 22 possible values). But what about organisms in which the bases occur with unequal frequencies? For example, we know that many organisms have DNA with G+C content quite a bit less than (or in some cases more than) 50%. The information content of the DNA will be less than 2 bits per base in such cases.

As an example, let's take Clostridium botulinum (the source of "Botox" serum), a soil bacterium with unusually low G+C content, at 28%. If we go through the organism's 3,404 protein-coding genes, we find actual base contents of:

A    0.40189
T    0.30603
G    0.18255
C    0.10840

These numbers are for a single strand (the coding strand or "message" strand) of DNA, which is why A and T aren't equal. For whole DNA, of course, A =T and G = C, but that's not the case here. We're just interested in the message strand.

If we put the above base frequencies into the Shannon equation, we come up with a value of 1.8467 for the information content (in bits) of one base of C. botulinum DNA. The DNA is about 7.67% redundant on a zero-order entropy basis. The DNA may be over 70% A and T, but it's a long way from being a two-base (one bit) information stream. Each base encodes an average of 1.8467 bits of information, which is a surprising amount (surprisingly close to 2.0) for such a skewed alphabet.

Tuesday, May 13, 2014

Anaerobic Catalases and Codon Pseudodegeneracy

Today I want to demolish two myths, one being that strict anaerobes (such as members of the Clostridia, a class of bacteria that includes the botulism organism as well as the C. diff bug that sickens half a million people a year in the U.S.) lack a catalase enzyme, and the other being that codons have very little degeneracy in base one. Don't worry, I'll parse all this for you so that even if you're not a bio-geek, you'll get it.

Bacteriology students are taught from Day One that strict anaerobes are killed by oxygen because they lack a catalase enzyme. Catalase is the nearly universal enzyme that breaks down peroxides into oxygen and water. The catechism about anaerobes being catalase-negative is true for some anaerobes, but by no means all. In fact, if you go to UniProt.org and do a search on "catalase clostridium" you'll get a long list of hits representing anaerobes that do, in fact, have a perfectly normal catalase enzyme. Some anaerobes have the classic katG (E. coli) version of catalase while others (like C. difficile) have a special manganese-containing catalase. Pathogenic anaerobes probably use the enzyme to combat the respiratory burst that accompanies engulfment by white blood cells. I've written about anaerobic  catalases before; for an overview of the enzymology and phylogenetic breakdown by enzyme type, start here.

Methanospirillum (a strict anaerobe found
in sewage) produces colonies with
characteristic striations spaced two
cell-lengths apart.
I thought it would be interesting to compare the katG genes of two anaerobes to see how they differ. I decided to focus on C. botulinum B (the nasty food-poisoning organism) and the archeon Methanospirillum hungatei strain JF-1, which is named after a professor I had in grad school, Bob Hungate, a true pioneer in the study of anaerobic bacteria.

C. botulinum and M. hungatei are worlds apart, taxonomically. The genome of the former has a G+C (guanine, cytosine) content of just 27%, while the latter has G+C of 45%. Comparing their catalases, I found 717 nucleotide differences between the two genes (which were 2193 bases long in one case and 2157 in the other case). On the surface, the genes seem (from a DNA sequence point of view) quite far apart phylogenetically. But after performing an alignment in Mega6, I was able to study the codons base by base and found that of 717 differences, 500 changes were synonoymous (resulting in no amino acid substitution) while only 217 changes were non-synonymous. Not unexpectedly, most of the synonymous changes were due to differences in the third codon base (where the genetic code is highly degenerate; see below).

Codon Usage in Clostridium botulinum B
Codon Usage: The Standard Code (transl_table=1)
CCA(P) 1.45%
CCG(P) 0.10%
CCT(P) 1.00%
CCC(P) 0.06%

CGA(R) 0.08%
CGG(R) 0.01%
CGT(R) 0.18%
CGC(R) 0.02%

CAA(Q) 2.01%
CAG(Q) 0.25%
CAT(H) 1.10%
CAC(H) 0.16%

CTA(L) 0.77%
CTG(L) 0.10%
CTT(L) 1.67%
CTC(L) 0.06%
GCA(A) 2.66%
GCG(A) 0.20%
GCT(A) 2.18%
GCC(A) 0.19%

GGA(G) 3.30%
GGG(G) 0.47%
GGT(G) 2.08%
GGC(G) 0.40%

GAA(E) 6.40%
GAG(E) 1.11%
GAT(D) 5.12%
GAC(D) 0.62%

GTA(V) 2.79%
GTG(V) 0.45%
GTT(V) 2.81%
GTC(V) 0.14%
ACA(T) 2.36%
ACG(T) 0.16%
ACT(T) 2.22%
ACC(T) 0.19%

AGA(R) 2.43%
AGG(R) 0.31%
AGT(S) 1.89%
AGC(S) 0.48%

AAA(K) 7.10%
AAG(K) 2.17%
AAT(N) 6.06%
AAC(N) 0.89%

ATA(I) 5.68%
ATG(M) 2.50%
ATT(I) 4.01%
ATC(I) 0.42%
TCA(S) 2.10%
TCG(S) 0.12%
TCT(S) 1.68%
TCC(S) 0.12%

TGA(*) 0.02%
TGG(W) 0.69%
TGT(C) 1.02%
TGC(C) 0.25%

TAA(*) 0.24%
TAG(*) 0.07%
TAT(Y) 3.61%
TAC(Y) 0.55%

TTA(L) 5.68%
TTG(L) 0.74%
TTT(F) 3.74%
TTC(F) 0.57%
Pink-highlighted codons are examples of where the genetic code is degenerate for base one. 
A change of CGA to AGA (for example) results in no amino acid change: the 
coded-for amino acid is arginine in either case.


"Degeneracy" means a change in a particular base results in no change of amino acid. An example is the codon AAA (see above), which is the code for lysine (K). Changing the last base to G (resulting in a codon of AAG) still produces lysine in the resulting protein. Codons that begin with 'AA' are two-fold degenerate, because two versions (AAA and AAG) produce the same amino acid. A codon that begins with 'CC' is said to be four-fold degenerate, because any of four different endings (CCA, CCT, CCG, CCC) produce the same amino acid (P, proline).

What's less obvious is that a change in the first (rather than the third) base of a codon can produce synonyms as well. For example, changing CGA to AGA results in the same amino acid: arginine (R). Likewise, CTA changing to TTA still results in leucine (L).

But there's more degeneracy to  the first codon base than just the classic cases outlined in pink in the above table. A change of CTA to ATA produces a switch from leucine to isoleucine, which is tantamount to no change at all, because leucine and isoleucine are isomers (both have the same chemical formula; they differ slightly in arrangement).


Likewise, a change from CTx to GTx produces a change from leucine to valine, which are workalikes. One could even argue that a change from CTT or CTC to TTT or TTC (leucine to phenylalanine) is functionally a degenerate change, because leucine and phenylalanine are nonpolar, hydrophobic amino acids with similar chemical properties.

So in fact, many changes to the first base of a codon can be considered pseudodegenerate, if you will, by virtue of producing non-synonymous changes that are tantamount to very little actual chemical change. We would expect to see a significant number of such changes in mutations affecting base one of codons. And we do.

In the two catalase genes, I looked at changes involving just the first base of a codon and found the following changes:

Synonymous:
AGA<-->CGA (R:R)
AGA<-->CGA (R:R)
AGA<-->CGA (R:R)
TTA<-->CTA (L:L)
TTG<-->CTG (L:L)
TTG<-->CTG (L:L)
TTG<-->CTG (L:L)
TTG<-->CTG (L:L)

Non-Synonymous:
AAA<-->CAA (K:Q)
AAA<-->GAA (K:E)
AAA<-->GAA (K:E)
AAG<-->CAG (K:Q)
AAG<-->CAG (K:Q)
AAT<-->CAT (N:H)
AAT<-->GAT (N:D)
AAT<-->GAT (N:D)
ATT<-->CTT (I:L) *
ATT<-->GTT (I:V) *
CAT<-->GAT (H:D)
CTG<-->ATG (L:M)
CTT<-->ATT (L:I) *
GCA<-->TCA (A:S)
TAT<-->CAT (Y:H)
TCA<-->GCA (S:A)
TCT<-->CCT (S:P)
TTT<-->CTT (F:L) *


Notice that 4 out of 18 non-synonymous changes (see asterisks) fall in the category of pseudodegeneracies. That's slightly higher than the 17% we would expect by chance. So instead of 8 out of 26 first-base changes being degenerate (in the catalase genes), more like 12 out of 26 are either synonymous or near-synonymous. Bottom line, roughly half of first-base mutations are effectively synonymous, at least in this example.

Sunday, April 06, 2014

A binary signal in the second codon base

In looking at base composition statistics of codons, an amazing fact jumps out.

If you look carefully at the following graph, you can see that the cloud of gold-colored data points (representing the compositional stats for the second base in codons of Clostridium botulinum) has a second, "breakaway" cloud underneath it. (See arrow.)

Codon base composition statistics for Clostridium botulinum. Notice the breakaway cloud of gold points under the main cloud (arrow). These points represent genes in which most codons have a pyrimidine in the middle base of each codon.

To review: I made this plot by going through each DNA sequence for each coding ("CDS") gene of C. botulinum, and for each gene, I went through all codons and calculated the average purine content (as well as the average G+C content) of the bases at a positions one, two, and three in the codons. Thus, every dot represents the stats for one gene's worth of data.

After looking at graphs of this sort, three key facts about codon bases leap out:
  • Most codons, for most genes, have a purine as the first base (notice how the red cloud of points is higher than the others, centering on y=0.7).
  • The third base (often called the "wobble" base; shown blue here) has the most extreme G+C value. (This is well known.)
  • The middle base falls in one of two positions (high or low) on the purine (y) axis. There's a primary cloud of data points and a secondary cloud in a distinct region below the main cloud. The secondary cloud of gold points is centered at about 0.3 on the y-axis, meaning these are genes in which the second codon base tends to be a pyrimidine
The question is: What does it mean when you look at a gene with 200 or 300 or 400 codons, and the majority of codons have a pyrimidine in the second base?

If you examine the standard codon translation table (below), you can see that codons with a pyrimidine in the second position (represented by the first two columns of the table) code primarily for nonpolar amino acids. When a pyrimidine is in the second base, the possible amino acids are phenylalanine, serine, leucine, proline, isoleucine, methionine, threonine, valine, and alanine. Of these, all but serine and threonine are nonpolar. Therefore, a pyrimidine in position two of a codon means there's at least a 75% chance that the amino acid will have a nonpolar, hydrophobic side group.


Virtually all proteins contain some nonpolar amino acids, but when a protein contains mostly nonpolar amino acids, that protein is destined to wind up in the cell membrane (which is largely made up of lipids). Thus, we can expect to find that genes in the "breakaway" cloud of gold points in the graph further above represent membrane-associated proteins.

To check this hypothesis, I wrote a script that harvested the gene names, and protein-product names, of all the "breakaway cloud" data points. After purging genes annotated (unhelpfully) as "hypothetical protein," I was left with 37 "known" genes. They're shown in the following table.

Gene Product
CLC_0571 arsenical pump family protein
CLC_1058 L-lactate permease
CLC_1550 carbohydrate ABC transporter permease
CLC_3115 xanthine/uracil permease family protein
CLC_0813 arsenical-resistance protein
CLC_1018 putative anion ABC transporter, permease protein
CLC_1687 xanthine/uracil permease family protein
CLC_3633 sporulation integral membrane protein YtvI
CLC_2382 phosphate ABC transporter, permease protein PstA
CLC_0189 ZIP transporter family protein
CLC_3351 sodium:dicarboxylate symporter family protein
CLC_0971 cobalt transport protein CbiM
CLC_1534 methionine ABC transporter permease
CLC_2798 xanthine/uracil permease family protein
CLC_1397 manganese/zinc/iron chelate ABC transporter permease
CLC_1836 stage III sporulation protein AD
CLC_0528 high-affinity branched-chain amino acid ABC transporter, permease protein
CLC_0430 electron transport complex, RnfABCDGE type, A subunit
CLC_2523 flagellar biosynthesis protein FliP
CLC_0401 amino acid permease family protein
CLC_0383 lrgB-like family protein
CLC_0457 chromate transporter protein
CLC_0291 sodium:dicarboxylate symporter family protein
CLC_0427 electron transport complex, RnfABCDGE type, D subunit
CLC_1281 putative transcriptional regulator
CLC_2008 ABC transporter, permease protein
CLC_0868 branched-chain amino acid transport system II carrier protein
CLC_1237 monovalent cation:proton antiporter-2 (CPA2) family protein
CLC_1137 methionine ABC transporter permease
CLC_0764 putative drug resistance ABC-2 type transporter, permease protein
CLC_1953 xanthine/uracil permease family protein
CLC_2444 auxin efflux carrier family protein
CLC_0897 putative ABC transporter, permease protein
CLC_1555 C4-dicarboxylate transporter/malic acid transport protein
CLC_0374 xanthine/uracil permease family protein
CLC_0470 undecaprenyl pyrophosphate phosphatase
CLC_2648 monovalent cation:proton antiporter-2 (CPA2) family protein

Notice that with the exception of CLC_1281, a "putative transcriptional regulator," every gene product represents a membrane-associated protein: transporters, carrier proteins, permeases, etc.

I ran the same experiment on genes from Streptomyces griseus (strain XylbKG-1) and came up with 222 genes having high pyrimidine content in base two. All 222 genes specify membrane-associated proteins. (The full list is in a table below.)

The bottom line: Base two of codons acts as a binary switch. If the base is a pyrimidine, the associated amino acid will most likely (75% chance) be nonpolar. If the base is a purine, the codon will either be a stop codon (3 out of 32 codons) or the amino acid will be polar (26 out of 29 codons).

Here's the list of 222 genes from S. griseus in which the middle codon base is predominantly a pyrimidine:

SACT1_0608 ABC-type transporter, integral membrane subunit
SACT1_3730 major facilitator superfamily MFS_1
SACT1_4066 cation efflux protein
SACT1_5911 ABC-2 type transporter
SACT1_6577 SNARE associated protein
SACT1_6966 ABC-type transporter, integral membrane subunit
SACT1_7160 major facilitator superfamily MFS_1
SACT1_3151 NADH-ubiquinone/plastoquinone oxidoreductase chain 6
SACT1_3682 drug resistance transporter, EmrB/QacA subfamily
SACT1_5431 Citrate transporter
SACT1_3199 proton-translocating NADH-quinone oxidoreductase, chain M
SACT1_7301 putative integral membrane protein
SACT1_3198 NAD(P)H-quinone oxidoreductase subunit 2
SACT1_2008 arsenical-resistance protein
SACT1_3149 proton-translocating NADH-quinone oxidoreductase, chain L
SACT1_3967 MATE efflux family protein
SACT1_3148 proton-translocating NADH-quinone oxidoreductase, chain M
SACT1_5571 major facilitator superfamily MFS_1
SACT1_0651 major facilitator superfamily MFS_1
SACT1_2669 ABC-type transporter, integral membrane subunit
SACT1_1805 NADH dehydrogenase (quinone)
SACT1_3147 NAD(P)H-quinone oxidoreductase subunit 2
SACT1_6961 major facilitator superfamily MFS_1
SACT1_0992 ABC-2 type transporter
SACT1_2619 major facilitator superfamily MFS_1
SACT1_0507 major facilitator superfamily MFS_1
SACT1_0649 ABC-type transporter, integral membrane subunit
SACT1_0800 glycosyl transferase family 4
SACT1_1659 ABC-type transporter, integral membrane subunit
SACT1_1803 multiple resistance and pH regulation protein F
SACT1_4190 putative ABC transporter permease protein
SACT1_5522 drug resistance transporter, EmrB/QacA subfamily
SACT1_5568 drug resistance transporter, EmrB/QacA subfamily
SACT1_7248 Lysine exporter protein (LYSE/YGGA)
SACT1_0266 ABC-2 type transporter
SACT1_0847 Na+/solute symporter
SACT1_4378 ABC-type transporter, integral membrane subunit
SACT1_6766 ABC-type transporter, integral membrane subunit
SACT1_2522 putative integral membrane protein
SACT1_4762 amino acid permease-associated region
SACT1_4901 major facilitator superfamily MFS_1
SACT1_2616 multiple antibiotic resistance (MarC)-related protein
SACT1_3961 major facilitator superfamily MFS_1
SACT1_6236 MIP family channel protein
SACT1_1319 protein of unknown function UPF0016
SACT1_2332 copper resistance D domain protein
SACT1_5327 ABC-type transporter, integral membrane subunit
SACT1_5759 ABC-2 type transporter
SACT1_1133 ABC-type transporter, integral membrane subunit
SACT1_1562 ABC-type transporter, integral membrane subunit
SACT1_5518 major facilitator superfamily MFS_1
SACT1_3430 ABC-type transporter, integral membrane subunit
SACT1_4517 major facilitator superfamily MFS_1
SACT1_4565 drug resistance transporter, EmrB/QacA subfamily
SACT1_4994 ABC-type transporter, integral membrane subunit
SACT1_7197 major facilitator superfamily MFS_1
SACT1_5949 major facilitator superfamily MFS_1
SACT1_6233 protein of unknown function DUF6 transmembrane
SACT1_0936 ABC-type transporter, integral membrane subunit
SACT1_4993 2-aminoethylphosphonate ABC transporter, permease protein
SACT1_6954 ABC-type transporter, integral membrane subunit
SACT1_1846 Lysine exporter protein (LYSE/YGGA)
SACT1_3429 ABC-type transporter, integral membrane subunit
SACT1_3957 membrane protein of unknown function
SACT1_4612 ABC-2 type transporter
SACT1_1998 polar amino acid ABC transporter, inner membrane subunit
SACT1_2093 ABC-type transporter, integral membrane subunit
SACT1_4420 ABC-2 type transporter
SACT1_6613 putative ABC transporter permease protein
SACT1_2564 small multidrug resistance protein
SACT1_3669 major facilitator superfamily MFS_1
SACT1_4186 major facilitator superfamily MFS_1
SACT1_4850 ABC-type transporter, integral membrane subunit
SACT1_0206 major facilitator superfamily MFS_1
SACT1_5418 Lysine exporter protein (LYSE/YGGA)
SACT1_0548 ABC-type transporter, integral membrane subunit
SACT1_3332 protein of unknown function DUF6 transmembrane
SACT1_3764 sodium/hydrogen exchanger
SACT1_6278 protein of unknown function DUF6 transmembrane
SACT1_4143 ABC-2 type transporter
SACT1_3232 ABC-type transporter, integral membrane subunit
SACT1_0256 ABC-type transporter, integral membrane subunit
SACT1_2898 major facilitator superfamily MFS_1
SACT1_6510 protein of unknown function DUF6 transmembrane
SACT1_0980 CrcB-like protein
SACT1_1650 ABC-type transporter, integral membrane subunit
SACT1_4658 acyltransferase 3
SACT1_0306 protein of unknown function DUF6 transmembrane
SACT1_2372 major facilitator superfamily MFS_1
SACT1_7238 ABC-type transporter, integral membrane subunit
SACT1_0202 major facilitator superfamily MFS_1
SACT1_0591 ABC-type transporter, integral membrane subunit
SACT1_5369 protein of unknown function DUF81
SACT1_6227 C4-dicarboxylate transporter/malic acid transport protein
SACT1_6755 ABC-type transporter, integral membrane subunit
SACT1_0978 Urea transporter
SACT1_2418 ATP synthase subunit a
SACT1_5604 major facilitator superfamily MFS_1
SACT1_4891 ABC-type transporter, integral membrane subunit
SACT1_6507 protein of unknown function DUF6 transmembrane
SACT1_6754 ABC-type transporter, integral membrane subunit
SACT1_0787 ABC-2 type transporter
SACT1_2848 sodium/hydrogen exchanger
SACT1_0636 ABC-type transporter, integral membrane subunit
SACT1_1891 putative integral membrane protein
SACT1_1552 xanthine permease
SACT1_2894 putative secreted protein
SACT1_4508 sodium:dicarboxylate symporter
SACT1_7091 drug resistance transporter, EmrB/QacA subfamily
SACT1_2652 major facilitator superfamily MFS_1
SACT1_1741 amino acid permease-associated region
SACT1_1838 ABC-type transporter, integral membrane subunit
SACT1_2796 gluconate transporter
SACT1_5220 sodium/hydrogen exchanger
SACT1_6991 ABC-type transporter, integral membrane subunit
SACT1_3273 protein of unknown function DUF894 DitE
SACT1_7089 ABC-type transporter, integral membrane subunit
SACT1_7280 major facilitator superfamily MFS_1
SACT1_3467 major facilitator superfamily MFS_1
SACT1_1304 ABC-type transporter, integral membrane subunit
SACT1_6032 protein of unknown function DUF81
SACT1_4312 major facilitator superfamily MFS_1
SACT1_0876 ABC-type transporter, integral membrane subunit
SACT1_6123 citrate/H+ symporter, CitMHS family
SACT1_4359 Cl- channel voltage-gated family protein
SACT1_7325 branched-chain amino acid transport
SACT1_1160 protein of unknown function DUF140
SACT1_2265 Arsenical pump membrane protein
SACT1_3512 ABC-2 type transporter
SACT1_1018 major facilitator superfamily MFS_1
SACT1_3415 Xanthine/uracil/vitamin C permease
SACT1_5214 BioY protein
SACT1_3656 small multidrug resistance protein
SACT1_3895 SpdD2 protein
SACT1_4929 ABC-type transporter, integral membrane subunit
SACT1_3029 major facilitator superfamily MFS_1
SACT1_6312 ABC-type transporter, integral membrane subunit
SACT1_0919 L-lactate transport
SACT1_4356 ABC-2 type transporter
SACT1_0532 ABC-type transporter, integral membrane subunit
SACT1_6693 secretion protein snm4
SACT1_0967 ABC-type transporter, integral membrane subunit
SACT1_6496 major facilitator superfamily MFS_1
SACT1_6983 major facilitator superfamily permease
SACT1_0917 NADH-ubiquinone/plastoquinone oxidoreductase chain 3
SACT1_6887 major facilitator superfamily MFS_1
SACT1_2835 major facilitator superfamily MFS_1
SACT1_5544 drug resistance transporter, Bcr/CflA subfamily
SACT1_5591 ABC-type transporter, integral membrane subunit
SACT1_1201 ABC-type transporter, integral membrane subunit
SACT1_2404 ABC-2 type transporter
SACT1_0870 protein of unknown function DUF803
SACT1_6933 ABC-type transporter, integral membrane subunit
SACT1_1776 ABC-type transporter, integral membrane subunit
SACT1_3213 major facilitator superfamily MFS_1
SACT1_4210 phosphate ABC transporter, inner membrane subunit PstC
SACT1_5398 protein of unknown function DUF81
SACT1_0914 NADH-ubiquinone/plastoquinone oxidoreductase chain 6
SACT1_3261 major facilitator superfamily MFS_1
SACT1_6932 ABC-type transporter, integral membrane subunit
SACT1_0669 major facilitator superfamily MFS_1
SACT1_4255 ABC-2 type transporter
SACT1_4541 major facilitator superfamily MFS_1
SACT1_4638 major facilitator superfamily MFS_1
SACT1_0913 NADH-ubiquinone oxidoreductase chain 4L
SACT1_4443 protein of unknown function DUF81
SACT1_5396 Xanthine/uracil/vitamin C permease
SACT1_0912 NADH dehydrogenase (quinone)
SACT1_2924 Lysine exporter protein (LYSE/YGGA)
SACT1_5922 putative integral membrane protein
SACT1_1243 Bile acid:sodium symporter
SACT1_6967 ABC-type transporter, integral membrane subunit
SACT1_0911 proton-translocating NADH-quinone oxidoreductase, chain M
SACT1_3931 putative ABC transporter permease protein
SACT1_2820 ABC-2 type transporter
SACT1_3298 putative integral membrane transport protein
SACT1_4871 major facilitator superfamily MFS_1
SACT1_5873 major facilitator superfamily MFS_1
SACT1_6636 putative integral membrane protein
SACT1_0905 AbgT transporter
SACT1_5532 2-dehydro-3-deoxyphosphogluconate aldolase/4-hydroxy-2-oxoglutarate aldolase
SACT1_0910 proton-translocating NADH-quinone oxidoreductase, chain N
SACT1_2115 ABC-type transporter, integral membrane subunit
SACT1_4635 protein of unknown function DUF6 transmembrane
SACT1_5777 major facilitator superfamily MFS_1
SACT1_3979 major facilitator superfamily MFS_1
SACT1_5536 major facilitator superfamily MFS_1
SACT1_6782 major facilitator superfamily MFS_1
SACT1_0616 virulence factor MVIN family protein
SACT1_4869 ABC-type transporter, integral membrane subunit
SACT1_1581 putative integral membrane protein
SACT1_4585 major facilitator superfamily MFS_1
SACT1_6536 small multidrug resistance protein
SACT1_4024 cell cycle protein
SACT1_5296 major facilitator superfamily MFS_1
SACT1_1865 major facilitator superfamily MFS_1
SACT1_4868 ABC-type transporter, integral membrane subunit
SACT1_0955 protein of unknown function DUF803
SACT1_4296 major facilitator superfamily MFS_1
SACT1_5104 major facilitator superfamily MFS_1
SACT1_0519 Bile acid:sodium symporter
SACT1_2394 putative ABC transporter permease protein
SACT1_0661 major facilitator superfamily MFS_1
SACT1_2062 major facilitator superfamily MFS_1
SACT1_4295 ABC-type transporter, integral membrane subunit
SACT1_6828 peptidase M48 Ste24p
SACT1_3446 major facilitator superfamily MFS_1
SACT1_6631 Lysine exporter protein (LYSE/YGGA)
SACT1_1048 ABC-type transporter, integral membrane subunit
SACT1_1528 protein of unknown function DUF6 transmembrane
SACT1_2016 branched-chain amino acid transport
SACT1_6154 gluconate transporter
SACT1_5051 major facilitator superfamily MFS_1
SACT1_5531 protein of unknown function DUF81
SACT1_6480 protein of unknown function DUF1290
SACT1_0373 ABC-type transporter, integral membrane subunit
SACT1_2392 putative ABC transporter permease protein
SACT1_2724 protein of unknown function DUF107
SACT1_7257 protein of unknown function UPF0118
SACT1_2772 putative integral membrane protein
SACT1_3201 NAD(P)H-quinone oxidoreductase subunit 4L
SACT1_7066 ABC-type transporter, integral membrane subunit

Some of these genes are labeled "protein of unknown function," but I think we can predict with high confidence, based on what we know about these proteins (namely, that they're hydrophobic) that the gene products in question involve membrane-associated functions.

Bioinformatics geeks, leave a comment below.

Thursday, August 15, 2013

Converting an SVG Graph to Histograms

The graphs you get from ZunZun.com (the free graphing service) are pretty neat, but one shortcoming of ZunZun is that it won't generate histograms. (Google Charts will do histograms, but unlike ZunZun, Google won't give you SVG output.) The answer? Convert a ZunZun graph to histograms yourself. It's only SVG, after all. It's XML; it's text. You just need to edit it.

Of course, nobody wants to hand-edit a zillion <use> elements (to convert data points to histogram rects). It makes more sense to do the job programmatically, with a little JavaScript.

In my case, I had a graph of dinucleotide frequencies for Clostridium botulinum coding regions. What that means is, I tallied the frequency of occurrence (in every protein-coding gene) of 5'-CpG-3', CpC, CpA, CpT, ApG, ApA, ApC, and all other dinucleotide combinations (16 in all). Since I already knew the frequency of G (by itself), A, C, and T, it was an easy matter to calculate the expected frequency of occurrence of each dinucleotide pair. (For example, A occurs with frequency 0.403, whereas G occurs with frequency 0.183. Therefore the expected frequency of occurrence of the sequence AG is 0.403 times 0.183, or 0.0738.) Bottom line, I had 16 expected frequencies and 16 actual frequencies, for 16 dinucleotide combos. I wanted side-by-side histograms of the frequencies.

First, I went to ZunZun and entered my raw data in the ZunZun form. Just so you know, this is what the raw data looked like:

0 0.16222793723642806
1 0.11352236777965981
2 0.07364933857345456
3 0.08166221769088752
4 0.123186555838253
5 0.12107590293804558
6 0.043711462078314355
7 0.03558766171971166
8 0.07364933857345456
9 0.07262685957145093
10 0.033435825941632816
11 0.03459042802303202
12 0.055925067612781175
13 0.042792101322514244
14 0.019844425842971265
15 0.02730405457750352
16 0.123186555838253
17 0.12232085101526233
18 0.055925067612781175
19 0.05502001002972254
20 0.09354077847378013
21 0.07321410524577443
22 0.03319196776961071
23 0.028600012050969865
24 0.043711462078314355
25 0.043328337600588136
26 0.019844425842971265
27 0.0062116692282947845
28 0.03319196776961071
29 0.04195172151930211
30 0.011777822917388797
31 0.015269662767317132


I made ZunZun graph the data, and it gave me back a graph that looked like this:



Which is fine except it's not a histogram plot. And it has goofy numbers on the x-axis.

I clicked the SVG link under the graph and saved an SVG copy to my local drive, then opened the file in Wordpad.

The first thing I did was locate my data points. That's easy: ZunZun plots points as a series of <use> elements. The elements are nested under a <g> element that looks like this:

<g clip-path="url(#p0c8061f7fd)">

I hand-edited this element to have an id attribute with value "DATA":

<g id="DATA" clip-path="url(#p0c8061f7fd)">

Next, I scrolled up to the very top of the file and found the first <defs> tag. Under it, I placed the following empty code block:

<script type="text/ecmascript"><![CDATA[
// code goes here

]]></script>

Then I went to work writing code (to go inside the above block) that would find the <use> elements, get their x,y values, and create <rect> elements of a height that would extend to the x-axis line.

The code I came up with looks like this:



// What is the SVG y-value of the x-axis?
// Attempt to discover by introspecting clipPath

function findGraphVerticalExtent( ) {
   var cp = document.getElementsByTagName('clipPath')[0];
   var rect = cp.childNodes[1];
   var top = rect.getAttribute('y') * 1;
   var bottom = rect.getAttribute('height') * 1;
   return top + bottom;
}


// This is for use with SVG graphs produced by ZunZun,
// in which data points are described in a series of
// <use> elements. We need to get the list of <use>
// nodes, convert it to a JS array, sort data points by
// x-value, and replace <use> with <rect> elements.

function changeToHistograms( ) {

   var GRAPH_VERTICAL_EXTENT = findGraphVerticalExtent( );

   // The 'g' element that encloses the 'use' elements
   // needs to have an id of "DATA" for this to work!
   // Manually edit the <g> node's id first!
   var data = document.getElementById( "DATA" );

   // NOTE: The following line gets a NodeList object,
   // which is NOT the same as a JavaScript array!
   var nodes = data.getElementsByTagName( "use" );

   // utility routine (an inner method)
   function nodeListToJavaScriptArray( nodes ) {

       var results = [];

       for (var i = 0; i < nodes.length; i++)
          results.push( nodes[i] );

       return results;
   }

   // utility routine (another inner method)
   function compareX( a,b ) {
       return a.getAttribute("x") * 1 - b.getAttribute("x") * 1;
   }

   var use = nodeListToJavaScriptArray( nodes );

   // We want the nodes in x-sorted order
   use.sort( compareX ); // presto, done

   // Main loop
   for (var i = 0; i < use.length; i++) {

       var rect =
           document.createElementNS("http://www.w3.org/2000/svg", "rect");
       var item = use[i];
       var x = item.getAttribute( "x" ) * 1;
       var y = item.getAttribute( "y" ) * 1;
       var rectWidth = 8;
       var rectHeight = GRAPH_VERTICAL_EXTENT - y;
       rect.setAttribute( "width", rectWidth.toString() );
       rect.setAttribute( "height", rectHeight.toString() );
       rect.setAttribute( "x" , x.toString() );
       rect.setAttribute( "y" , y.toString() );

       // We will alternate colors, pink/purple
       rect.setAttribute( "style" ,
           (i%2==0)? "fill:ce8877;stroke:none" : "fill:8877dd;stroke:none" );

       data.appendChild( rect ); // add a new rect
       item.remove(); // delete the old <use> element
   }

   return use;
}

As so often happens, I ended up writing more code than I thought it would take. The above code works fine for converting data points to histogram bars (as long as you remember to give that <g> element the id attribute of "DATA" as mentioned earlier). But you need to trigger the code somehow. Answer: insert onload="changeToHistograms( )" in the <svg> element at the very top of the file.

But I wasn't done, because I also wanted to apply data labels to the histogram bars (labels like "CG," "AG," "CC," etc.) and get rid of the goofy numbers on the x-axis.

This is the function I came up with to apply the labels:


   function applyLabels( sortedNodes ) {
 
    var labels = ["aa", "ag", "at", "ac", 
      "ga", "gg", "gt", "gc", "ta", "tg", 
      "tt", "tc", "ca", "cg", "ct", "cc"];

      var data = document.getElementById( "DATA" );
 var labelIndex = 0;

 for (var i = 0; i < sortedNodes.length; i+=2) {
     var text = 
              document.createElementNS("http://www.w3.org/2000/svg", "text");
     var node = sortedNodes[i];
          text.setAttribute( "x", String( node.getAttribute("x")*1 +2) );
          text.setAttribute( "y", String( node.getAttribute("y")*1 - 13 ) );
          text.setAttribute( "style", "font-size:9pt" );
          text.textContent = labels[ labelIndex++ ].toUpperCase();
          text.setAttribute( "id", "label_" + labelIndex );
          data.appendChild( text );
      }
   }


And here's a utility function that can strip numbers off the x-axis:

   // Optional. Call this to remove ZunZun graph labels.
   // pass [1,2,3,4,5,6,7,8,9] to remove x-axis labels
   function removeZunZunLabels( indexes ) {
 
 for (var i = 0;i < indexes.length;i++) 
    try {
   document.getElementById("text_"+indexes[i]).remove();
   }
  catch(e) { console.log("Index " + i + " not found; skipped.");
   }
   } 
  
BTW, if you're wondering why I multiply so many things by one, it's because the attribute values that comprise x and y values in SVG are String objects. If you add them, you're concatenating strings, which is not what you want. To convert a number in string form to an actual JavaScript number (so you can add numbers and not concatenate strings), you can either multiply by one or explicitly coerce a string to a number by doing Number( x ).

The final result of all this looks like:


Final graph after surgery. Expected (pink) and actual (purple) frequencies of occurrence of various dinucleotide sequences in C. botulinum coding-region DNA.

Which is approximately what I wanted to see. The labels could be positioned better, but you get the idea.

What does the graph show? Well first of all, you have to realize that the DNA of C. botulinum is extremely rich in adenine and thymine (A and T): Those two bases constitute 72% of the DNA. Therefore it's absolutely no surprise that the highest bars are those that contain A and/or T. What's perhaps interesting is that the most abundant base (A), which should form 'AA' sequences at a high rate, doesn't. (Compare the first bar on the left to the shorter purple bar beside it.) This is especially surprising when you consider that AAA, GAA, and AAT are by far the most-used codons in C. botulinum. In other words, 'AA' occurs a lot, in codons. But even so, it doesn't occur as much as one would expect.

It's also interesting to compare GC with CG. (Non-biologists, note that these two pairs are not equivalent, because DNA has a built-in reading direction. The notation GC, or equivalently, GpC, means there's a guanine sitting on the 5' side of cytosine. The notation CG means there's a guanine on the 3' side of cytosine. The 5' and 3' numbers refer to deoxyribose carbon positions.) The GC combo occurs more often than predicted by chance whereas the combination CG (or CpG, as it's also written) occurs much less frequently than predicted by chance. The reasons for this are fairly technical. Suffice it to say, it's a good prima facie indicator that C. botulinum DNA is heavily methylated. Which in fact it is.

Sunday, July 14, 2013

DNA Strand Asymmetry: More Surprises

The surprises just keep coming. When you start doing comparative genomics on the desktop (which is so easy with all the great tools at genomevolution.org and elsewhere), it's amazing how quickly you run into things that make you slap yourself on the side of the head and go "Whaaaa????"

If you know anything about DNA (or even if you don't), this one will set you back.

I've written before about Chargaff's second parity rule, which (peculiarly) states that A = T and G = C not just for double-stranded DNA (that's the first parity rule) but for bases in a single strand of DNA. The first parity rule is basic: It's what allows one strand of DNA to be complementary to another. The second parity rule is not so intuitive. Why should the amount of adenine have to equal the amount of thymine (or guanine equal cytosine) in a single strand of DNA? The conventional argument is that nature doesn't play favorites with purines and pyrimidines. There's no reason (in theory) why a single strand of DNA should have an excess of purines over pyrimidines or vice versa, all things being equal.

But it turns out, strand asymmetry vis-a-vis purines and pyrimidines is not only not uncommon, it's the rule. (Some call it Szybalski's rule, in fact.) You can prove it to yourself very easily. If you obtain a codon usage chart for a particular organism, then add the frequencies of occurrence of each base in each codon, you can get the relative abundances of the four bases (A, G, T, C) for the coding regions on which the codon chart was based. Let's take a simple example that requires no calculation: Clostridium botulinum. Just by eyeballing the chart below, you can quickly see that (for C. botulinum) codons using purines A and G are way-more-often used than codons containing pyrimidines T and C. (Note the green-highlighted codons.)


If you do the math, you'll find that in C. botulinum, G and A (combined) outnumber T and C by a factor of 1.41. That's a pretty extreme purine:pyrimidine ratio. (Remember that we're dealing with a single strand of DNA here. Codon frequencies are derived from the so-called "message strand" of DNA in coding regions.)

I've done this calculation for 1,373 different bacterial species (don't worry, it's all automated), and the bottom line is, the greater the DNA's A+T content (or, equivalently, the less its G+C content), the greater the purine imbalance. (See this post for a nice graph.)

If you inspect enough codon charts you'll quickly realize that Chargaff's second parity rule never holds true (except now and then by chance). It's a bogus rule, at least in coding regions (DNA that actually gets transcribed in vivo). It may have applicability to pseudogenes or "junk DNA" (but then again, I haven't checked; it may well not apply there either).

If Chargaff's second rule were true, we would expect to find that G = C (and A = T), because that's what the rule says. I went through the codon frequency data for 1,373 different bacterial species and then plotted the ratio of G to C (which Chargaff says should equal 1.0) for each species against the A+T content (which is a kind of phylogenetic signature) for each species. I was shocked by what I found:

Using base abundances derived from codon frequency data, I calculated G/C for 1,373 bacterial species and plotted it against total A+T content. (Each dot represents a genome for a particular organism.) Chargaff's second parity rule predicts a horizontal line at y=1.0. Clearly, that rule doesn't hold. 

I wasn't so much shocked by the fact that Chargaff's rule doesn't hold; I already knew that. What's shocking is that the ratio of G to C goes up as A+T increases, which means G/C is going up even as G+C is going down. (By definition, G+C goes down as A+T goes up.)

Chargaff says G/C should always equal 1.0. In reality, it never does except by chance. What we find is, the less G (or C) the DNA has, the greater the ratio of G to C. To put it differently: At the high-AT end of the phylogenetic scale, cytosine is decreasing faster (much faster) than guanine, as overall G+C content goes down.

When I first plotted this graph, I used a linear regression to get a line that minimizes the sum of squared absolute error. That line turned out to be given by 0.638 + [A+T]. Then I saw that the data looked exponential, not linear. So I refitted the data with a power curve (the red curve shown above) given by

G/C  = 1.0 + 0.587*[A+T] + 1.618*[A+T]2

which fit the data even better (minimum summed error 0.1119 instead of 0.1197). What struck me as strange is that the Golden Ratio (1.618) shows up in the power-curve formula (above), but also, the linear form of the regression has G/C equaliing 1.638 when [A+T] goes to 1.0. Which is almost the Golden Ratio.

In a previous post, I mentioned finding that the ratio A/T tends to approximate the Golden Ratio as A+T approaches 1.0. If this were to hold true, it could mean that A/T and G/C both approach the Golden Ratio as A+T approaches 1.0, which would be weird indeed.

For now, I'm not going to make the claim that the Golden Ratio figures into any of this, because it reeks too much of numerology and Intelligent Design (and I'm a fan of neither). I do think it's mildly interesting that A/T and G/C both approach a similar number as A+T approaches unity.

Comments, as usual, are welcome.