blogorrhea: Clostridium botulinum

Showing posts with label Clostridium botulinum. Show all posts

Monday, May 26, 2014

Shannon Entropy and DNA

Claude Shannon made an important finding when he realized that the information contribution of a symbol could be estimated very simply as -f(x) log(f(x)), where f(x) is the frequency of occurrence of the symbol x. For example, a series of a coin tosses can be considered a binary information stream with symbol values H and T (heads and tails). If the frequency of H is 0.5 and f(T) is 0.5, the entropy E, in bits per toss, is -0.5 times log (base 2) 0.5 for heads, and a similar value for tails. The values add up (in this case) to 1.0. The intuitive meaning of 1.0 (the Shannon entropy) is that a single coin toss conveys 1.0 bit of information. Contrast this with the situation that prevails when using a "weighted" or unfair penny that lands heads-up 70% of the time. We know intuitively that tossing such a coin will produce less information because we can predict the outcome (heads), to a degree. Something that's predictable is uninformative. Shannon's equation gives -0.7 times log(0.7) = 0.3602 for heads and -0.3 * log (0.3) = 0.5211 for tails, for an entropy of 0.8813 bits per toss. In this case we can say that a toss is 11.87% redundant.

Claude Shannon

DNA is an information stream resembling a series of four-sided-coin tosses, where the "coin" can land with values of A, T, G, or C. In some organisms, the four bases occur with equal rates (25% each), in which case the DNA has a Shannon entropy of 2.0 bits per base (which makes sense, in that a base can encode one of 2² possible values). But what about organisms in which the bases occur with unequal frequencies? For example, we know that many organisms have DNA with G+C content quite a bit less than (or in some cases more than) 50%. The information content of the DNA will be less than 2 bits per base in such cases.

As an example, let's take Clostridium botulinum (the source of "Botox" serum), a soil bacterium with unusually low G+C content, at 28%. If we go through the organism's 3,404 protein-coding genes, we find actual base contents of:

A    0.40189
T    0.30603
G    0.18255
C    0.10840

These numbers are for a single strand (the coding strand or "message" strand) of DNA, which is why A and T aren't equal. For whole DNA, of course, A =T and G = C, but that's not the case here. We're just interested in the message strand.

If we put the above base frequencies into the Shannon equation, we come up with a value of 1.8467 for the information content (in bits) of one base of C. botulinum DNA. The DNA is about 7.67% redundant on a zero-order entropy basis. The DNA may be over 70% A and T, but it's a long way from being a two-base (one bit) information stream. Each base encodes an average of 1.8467 bits of information, which is a surprising amount (surprisingly close to 2.0) for such a skewed alphabet.

Tuesday, May 13, 2014

Anaerobic Catalases and Codon Pseudodegeneracy

Today I want to demolish two myths, one being that strict anaerobes (such as members of the Clostridia, a class of bacteria that includes the botulism organism as well as the C. diff bug that sickens half a million people a year in the U.S.) lack a catalase enzyme, and the other being that codons have very little degeneracy in base one. Don't worry, I'll parse all this for you so that even if you're not a bio-geek, you'll get it.

Bacteriology students are taught from Day One that strict anaerobes are killed by oxygen because they lack a catalase enzyme. Catalase is the nearly universal enzyme that breaks down peroxides into oxygen and water. The catechism about anaerobes being catalase-negative is true for some anaerobes, but by no means all. In fact, if you go to UniProt.org and do a search on "catalase clostridium" you'll get a long list of hits representing anaerobes that do, in fact, have a perfectly normal catalase enzyme. Some anaerobes have the classic katG (E. coli) version of catalase while others (like C. difficile) have a special manganese-containing catalase. Pathogenic anaerobes probably use the enzyme to combat the respiratory burst that accompanies engulfment by white blood cells. I've written about anaerobic catalases before; for an overview of the enzymology and phylogenetic breakdown by enzyme type, start here.

Methanospirillum (a strict anaerobe found
in sewage) produces colonies with
characteristic striations spaced two
cell-lengths apart.

I thought it would be interesting to compare the katG genes of two anaerobes to see how they differ. I decided to focus on C. botulinum B (the nasty food-poisoning organism) and the archeon Methanospirillum hungatei strain JF-1, which is named after a professor I had in grad school, Bob Hungate, a true pioneer in the study of anaerobic bacteria.

C. botulinum and M. hungatei are worlds apart, taxonomically. The genome of the former has a G+C (guanine, cytosine) content of just 27%, while the latter has G+C of 45%. Comparing their catalases, I found 717 nucleotide differences between the two genes (which were 2193 bases long in one case and 2157 in the other case). On the surface, the genes seem (from a DNA sequence point of view) quite far apart phylogenetically. But after performing an alignment in Mega6, I was able to study the codons base by base and found that of 717 differences, 500 changes were synonoymous (resulting in no amino acid substitution) while only 217 changes were non-synonymous. Not unexpectedly, most of the synonymous changes were due to differences in the third codon base (where the genetic code is highly degenerate; see below).

Codon Usage in Clostridium botulinum B
Codon Usage: The Standard Code (transl_table=1)

CCA(P) 1.45% CCG(P) 0.10% CCT(P) 1.00% CCC(P) 0.06%	CGA(R) 0.08% CGG(R) 0.01% CGT(R) 0.18% CGC(R) 0.02%	CAA(Q) 2.01% CAG(Q) 0.25% CAT(H) 1.10% CAC(H) 0.16%	CTA(L) 0.77% CTG(L) 0.10% CTT(L) 1.67% CTC(L) 0.06%
GCA(A) 2.66% GCG(A) 0.20% GCT(A) 2.18% GCC(A) 0.19%	GGA(G) 3.30% GGG(G) 0.47% GGT(G) 2.08% GGC(G) 0.40%	GAA(E) 6.40% GAG(E) 1.11% GAT(D) 5.12% GAC(D) 0.62%	GTA(V) 2.79% GTG(V) 0.45% GTT(V) 2.81% GTC(V) 0.14%
ACA(T) 2.36% ACG(T) 0.16% ACT(T) 2.22% ACC(T) 0.19%	AGA(R) 2.43% AGG(R) 0.31% AGT(S) 1.89% AGC(S) 0.48%	AAA(K) 7.10% AAG(K) 2.17% AAT(N) 6.06% AAC(N) 0.89%	ATA(I) 5.68% ATG(M) 2.50% ATT(I) 4.01% ATC(I) 0.42%
TCA(S) 2.10% TCG(S) 0.12% TCT(S) 1.68% TCC(S) 0.12%	TGA(*) 0.02% TGG(W) 0.69% TGT(C) 1.02% TGC(C) 0.25%	TAA() 0.24% TAG() 0.07% TAT(Y) 3.61% TAC(Y) 0.55%	TTA(L) 5.68% TTG(L) 0.74% TTT(F) 3.74% TTC(F) 0.57%

Pink-highlighted codons are examples of where the genetic code is degenerate for base one.
A change of CGA to AGA (for example) results in no amino acid change: the
coded-for amino acid is arginine in either case.

"Degeneracy" means a change in a particular base results in no change of amino acid. An example is the codon AAA (see above), which is the code for lysine (K). Changing the last base to G (resulting in a codon of AAG) still produces lysine in the resulting protein. Codons that begin with 'AA' are two-fold degenerate, because two versions (AAA and AAG) produce the same amino acid. A codon that begins with 'CC' is said to be four-fold degenerate, because any of four different endings (CCA, CCT, CCG, CCC) produce the same amino acid (P, proline).

What's less obvious is that a change in the first (rather than the third) base of a codon can produce synonyms as well. For example, changing CGA to AGA results in the same amino acid: arginine (R). Likewise, CTA changing to TTA still results in leucine (L).

But there's more degeneracy to the first codon base than just the classic cases outlined in pink in the above table. A change of CTA to ATA produces a switch from leucine to isoleucine, which is tantamount to no change at all, because leucine and isoleucine are isomers (both have the same chemical formula; they differ slightly in arrangement).

Likewise, a change from CTx to GTx produces a change from leucine to valine, which are workalikes. One could even argue that a change from CTT or CTC to TTT or TTC (leucine to phenylalanine) is functionally a degenerate change, because leucine and phenylalanine are nonpolar, hydrophobic amino acids with similar chemical properties.

So in fact, many changes to the first base of a codon can be considered pseudodegenerate, if you will, by virtue of producing non-synonymous changes that are tantamount to very little actual chemical change. We would expect to see a significant number of such changes in mutations affecting base one of codons. And we do.

In the two catalase genes, I looked at changes involving just the first base of a codon and found the following changes:

Synonymous:
AGA<-->CGA (R:R)
AGA<-->CGA (R:R)
AGA<-->CGA (R:R)
TTA<-->CTA (L:L)
TTG<-->CTG (L:L)
TTG<-->CTG (L:L)
TTG<-->CTG (L:L)
TTG<-->CTG (L:L)

Non-Synonymous:
AAA<-->CAA (K:Q)
AAA<-->GAA (K:E)
AAA<-->GAA (K:E)
AAG<-->CAG (K:Q)
AAG<-->CAG (K:Q)
AAT<-->CAT (N:H)
AAT<-->GAT (N:D)
AAT<-->GAT (N:D)
ATT<-->CTT (I:L) *
ATT<-->GTT (I:V) *
CAT<-->GAT (H:D)
CTG<-->ATG (L:M)
CTT<-->ATT (L:I) *
GCA<-->TCA (A:S)
TAT<-->CAT (Y:H)
TCA<-->GCA (S:A)
TCT<-->CCT (S:P)
TTT<-->CTT (F:L) *

Notice that 4 out of 18 non-synonymous changes (see asterisks) fall in the category of pseudodegeneracies. That's slightly higher than the 17% we would expect by chance. So instead of 8 out of 26 first-base changes being degenerate (in the catalase genes), more like 12 out of 26 are either synonymous or near-synonymous. Bottom line, roughly half of first-base mutations are effectively synonymous, at least in this example.

Sunday, April 06, 2014

A binary signal in the second codon base

In looking at base composition statistics of codons, an amazing fact jumps out.

If you look carefully at the following graph, you can see that the cloud of gold-colored data points (representing the compositional stats for the second base in codons of Clostridium botulinum) has a second, "breakaway" cloud underneath it. (See arrow.)

Codon base composition statistics for Clostridium botulinum. Notice the breakaway cloud of gold points under the main cloud (arrow). These points represent genes in which most codons have a pyrimidine in the middle base of each codon.

To review: I made this plot by going through each DNA sequence for each coding ("CDS") gene of C. botulinum, and for each gene, I went through all codons and calculated the average purine content (as well as the average G+C content) of the bases at a positions one, two, and three in the codons. Thus, every dot represents the stats for one gene's worth of data.

After looking at graphs of this sort, three key facts about codon bases leap out:

Most codons, for most genes, have a purine as the first base (notice how the red cloud of points is higher than the others, centering on y=0.7).
The third base (often called the "wobble" base; shown blue here) has the most extreme G+C value. (This is well known.)
The middle base falls in one of two positions (high or low) on the purine (y) axis. There's a primary cloud of data points and a secondary cloud in a distinct region below the main cloud. The secondary cloud of gold points is centered at about 0.3 on the y-axis, meaning these are genes in which the second codon base tends to be a pyrimidine.

The question is: What does it mean when you look at a gene with 200 or 300 or 400 codons, and the majority of codons have a pyrimidine in the second base?

If you examine the standard codon translation table (below), you can see that codons with a pyrimidine in the second position (represented by the first two columns of the table) code primarily for nonpolar amino acids. When a pyrimidine is in the second base, the possible amino acids are phenylalanine, serine, leucine, proline, isoleucine, methionine, threonine, valine, and alanine. Of these, all but serine and threonine are nonpolar. Therefore, a pyrimidine in position two of a codon means there's at least a 75% chance that the amino acid will have a nonpolar, hydrophobic side group.

Virtually all proteins contain some nonpolar amino acids, but when a protein contains mostly nonpolar amino acids, that protein is destined to wind up in the cell membrane (which is largely made up of lipids). Thus, we can expect to find that genes in the "breakaway" cloud of gold points in the graph further above represent membrane-associated proteins.

To check this hypothesis, I wrote a script that harvested the gene names, and protein-product names, of all the "breakaway cloud" data points. After purging genes annotated (unhelpfully) as "hypothetical protein," I was left with 37 "known" genes. They're shown in the following table.

Gene	Product
CLC_0571	arsenical pump family protein
CLC_1058	L-lactate permease
CLC_1550	carbohydrate ABC transporter permease
CLC_3115	xanthine/uracil permease family protein
CLC_0813	arsenical-resistance protein
CLC_1018	putative anion ABC transporter, permease protein
CLC_1687	xanthine/uracil permease family protein
CLC_3633	sporulation integral membrane protein YtvI
CLC_2382	phosphate ABC transporter, permease protein PstA
CLC_0189	ZIP transporter family protein
CLC_3351	sodium:dicarboxylate symporter family protein
CLC_0971	cobalt transport protein CbiM
CLC_1534	methionine ABC transporter permease
CLC_2798	xanthine/uracil permease family protein
CLC_1397	manganese/zinc/iron chelate ABC transporter permease
CLC_1836	stage III sporulation protein AD
CLC_0528	high-affinity branched-chain amino acid ABC transporter, permease protein
CLC_0430	electron transport complex, RnfABCDGE type, A subunit
CLC_2523	flagellar biosynthesis protein FliP
CLC_0401	amino acid permease family protein
CLC_0383	lrgB-like family protein
CLC_0457	chromate transporter protein
CLC_0291	sodium:dicarboxylate symporter family protein
CLC_0427	electron transport complex, RnfABCDGE type, D subunit
CLC_1281	putative transcriptional regulator
CLC_2008	ABC transporter, permease protein
CLC_0868	branched-chain amino acid transport system II carrier protein
CLC_1237	monovalent cation:proton antiporter-2 (CPA2) family protein
CLC_1137	methionine ABC transporter permease
CLC_0764	putative drug resistance ABC-2 type transporter, permease protein
CLC_1953	xanthine/uracil permease family protein
CLC_2444	auxin efflux carrier family protein
CLC_0897	putative ABC transporter, permease protein
CLC_1555	C4-dicarboxylate transporter/malic acid transport protein
CLC_0374	xanthine/uracil permease family protein
CLC_0470	undecaprenyl pyrophosphate phosphatase
CLC_2648	monovalent cation:proton antiporter-2 (CPA2) family protein

Notice that with the exception of CLC_1281, a "putative transcriptional regulator," every gene product represents a membrane-associated protein: transporters, carrier proteins, permeases, etc.

I ran the same experiment on genes from Streptomyces griseus (strain XylbKG-1) and came up with 222 genes having high pyrimidine content in base two. All 222 genes specify membrane-associated proteins. (The full list is in a table below.)

The bottom line: Base two of codons acts as a binary switch. If the base is a pyrimidine, the associated amino acid will most likely (75% chance) be nonpolar. If the base is a purine, the codon will either be a stop codon (3 out of 32 codons) or the amino acid will be polar (26 out of 29 codons).

Here's the list of 222 genes from S. griseus in which the middle codon base is predominantly a pyrimidine:

SACT1_0608	ABC-type transporter, integral membrane subunit
SACT1_3730	major facilitator superfamily MFS_1
SACT1_4066	cation efflux protein
SACT1_5911	ABC-2 type transporter
SACT1_6577	SNARE associated protein
SACT1_6966	ABC-type transporter, integral membrane subunit
SACT1_7160	major facilitator superfamily MFS_1
SACT1_3151	NADH-ubiquinone/plastoquinone oxidoreductase chain 6
SACT1_3682	drug resistance transporter, EmrB/QacA subfamily
SACT1_5431	Citrate transporter
SACT1_3199	proton-translocating NADH-quinone oxidoreductase, chain M
SACT1_7301	putative integral membrane protein
SACT1_3198	NAD(P)H-quinone oxidoreductase subunit 2
SACT1_2008	arsenical-resistance protein
SACT1_3149	proton-translocating NADH-quinone oxidoreductase, chain L
SACT1_3967	MATE efflux family protein
SACT1_3148	proton-translocating NADH-quinone oxidoreductase, chain M
SACT1_5571	major facilitator superfamily MFS_1
SACT1_0651	major facilitator superfamily MFS_1
SACT1_2669	ABC-type transporter, integral membrane subunit
SACT1_1805	NADH dehydrogenase (quinone)
SACT1_3147	NAD(P)H-quinone oxidoreductase subunit 2
SACT1_6961	major facilitator superfamily MFS_1
SACT1_0992	ABC-2 type transporter
SACT1_2619	major facilitator superfamily MFS_1
SACT1_0507	major facilitator superfamily MFS_1
SACT1_0649	ABC-type transporter, integral membrane subunit
SACT1_0800	glycosyl transferase family 4
SACT1_1659	ABC-type transporter, integral membrane subunit
SACT1_1803	multiple resistance and pH regulation protein F
SACT1_4190	putative ABC transporter permease protein
SACT1_5522	drug resistance transporter, EmrB/QacA subfamily
SACT1_5568	drug resistance transporter, EmrB/QacA subfamily
SACT1_7248	Lysine exporter protein (LYSE/YGGA)
SACT1_0266	ABC-2 type transporter
SACT1_0847	Na+/solute symporter
SACT1_4378	ABC-type transporter, integral membrane subunit
SACT1_6766	ABC-type transporter, integral membrane subunit
SACT1_2522	putative integral membrane protein
SACT1_4762	amino acid permease-associated region
SACT1_4901	major facilitator superfamily MFS_1
SACT1_2616	multiple antibiotic resistance (MarC)-related protein
SACT1_3961	major facilitator superfamily MFS_1
SACT1_6236	MIP family channel protein
SACT1_1319	protein of unknown function UPF0016
SACT1_2332	copper resistance D domain protein
SACT1_5327	ABC-type transporter, integral membrane subunit
SACT1_5759	ABC-2 type transporter
SACT1_1133	ABC-type transporter, integral membrane subunit
SACT1_1562	ABC-type transporter, integral membrane subunit
SACT1_5518	major facilitator superfamily MFS_1
SACT1_3430	ABC-type transporter, integral membrane subunit
SACT1_4517	major facilitator superfamily MFS_1
SACT1_4565	drug resistance transporter, EmrB/QacA subfamily
SACT1_4994	ABC-type transporter, integral membrane subunit
SACT1_7197	major facilitator superfamily MFS_1
SACT1_5949	major facilitator superfamily MFS_1
SACT1_6233	protein of unknown function DUF6 transmembrane
SACT1_0936	ABC-type transporter, integral membrane subunit
SACT1_4993	2-aminoethylphosphonate ABC transporter, permease protein
SACT1_6954	ABC-type transporter, integral membrane subunit
SACT1_1846	Lysine exporter protein (LYSE/YGGA)
SACT1_3429	ABC-type transporter, integral membrane subunit
SACT1_3957	membrane protein of unknown function
SACT1_4612	ABC-2 type transporter
SACT1_1998	polar amino acid ABC transporter, inner membrane subunit
SACT1_2093	ABC-type transporter, integral membrane subunit
SACT1_4420	ABC-2 type transporter
SACT1_6613	putative ABC transporter permease protein
SACT1_2564	small multidrug resistance protein
SACT1_3669	major facilitator superfamily MFS_1
SACT1_4186	major facilitator superfamily MFS_1
SACT1_4850	ABC-type transporter, integral membrane subunit
SACT1_0206	major facilitator superfamily MFS_1
SACT1_5418	Lysine exporter protein (LYSE/YGGA)
SACT1_0548	ABC-type transporter, integral membrane subunit
SACT1_3332	protein of unknown function DUF6 transmembrane
SACT1_3764	sodium/hydrogen exchanger
SACT1_6278	protein of unknown function DUF6 transmembrane
SACT1_4143	ABC-2 type transporter
SACT1_3232	ABC-type transporter, integral membrane subunit
SACT1_0256	ABC-type transporter, integral membrane subunit
SACT1_2898	major facilitator superfamily MFS_1
SACT1_6510	protein of unknown function DUF6 transmembrane
SACT1_0980	CrcB-like protein
SACT1_1650	ABC-type transporter, integral membrane subunit
SACT1_4658	acyltransferase 3
SACT1_0306	protein of unknown function DUF6 transmembrane
SACT1_2372	major facilitator superfamily MFS_1
SACT1_7238	ABC-type transporter, integral membrane subunit
SACT1_0202	major facilitator superfamily MFS_1
SACT1_0591	ABC-type transporter, integral membrane subunit
SACT1_5369	protein of unknown function DUF81
SACT1_6227	C4-dicarboxylate transporter/malic acid transport protein
SACT1_6755	ABC-type transporter, integral membrane subunit
SACT1_0978	Urea transporter
SACT1_2418	ATP synthase subunit a
SACT1_5604	major facilitator superfamily MFS_1
SACT1_4891	ABC-type transporter, integral membrane subunit
SACT1_6507	protein of unknown function DUF6 transmembrane
SACT1_6754	ABC-type transporter, integral membrane subunit
SACT1_0787	ABC-2 type transporter
SACT1_2848	sodium/hydrogen exchanger
SACT1_0636	ABC-type transporter, integral membrane subunit
SACT1_1891	putative integral membrane protein
SACT1_1552	xanthine permease
SACT1_2894	putative secreted protein
SACT1_4508	sodium:dicarboxylate symporter
SACT1_7091	drug resistance transporter, EmrB/QacA subfamily
SACT1_2652	major facilitator superfamily MFS_1
SACT1_1741	amino acid permease-associated region
SACT1_1838	ABC-type transporter, integral membrane subunit
SACT1_2796	gluconate transporter
SACT1_5220	sodium/hydrogen exchanger
SACT1_6991	ABC-type transporter, integral membrane subunit
SACT1_3273	protein of unknown function DUF894 DitE
SACT1_7089	ABC-type transporter, integral membrane subunit
SACT1_7280	major facilitator superfamily MFS_1
SACT1_3467	major facilitator superfamily MFS_1
SACT1_1304	ABC-type transporter, integral membrane subunit
SACT1_6032	protein of unknown function DUF81
SACT1_4312	major facilitator superfamily MFS_1
SACT1_0876	ABC-type transporter, integral membrane subunit
SACT1_6123	citrate/H+ symporter, CitMHS family
SACT1_4359	Cl- channel voltage-gated family protein
SACT1_7325	branched-chain amino acid transport
SACT1_1160	protein of unknown function DUF140
SACT1_2265	Arsenical pump membrane protein
SACT1_3512	ABC-2 type transporter
SACT1_1018	major facilitator superfamily MFS_1
SACT1_3415	Xanthine/uracil/vitamin C permease
SACT1_5214	BioY protein
SACT1_3656	small multidrug resistance protein
SACT1_3895	SpdD2 protein
SACT1_4929	ABC-type transporter, integral membrane subunit
SACT1_3029	major facilitator superfamily MFS_1
SACT1_6312	ABC-type transporter, integral membrane subunit
SACT1_0919	L-lactate transport
SACT1_4356	ABC-2 type transporter
SACT1_0532	ABC-type transporter, integral membrane subunit
SACT1_6693	secretion protein snm4
SACT1_0967	ABC-type transporter, integral membrane subunit
SACT1_6496	major facilitator superfamily MFS_1
SACT1_6983	major facilitator superfamily permease
SACT1_0917	NADH-ubiquinone/plastoquinone oxidoreductase chain 3
SACT1_6887	major facilitator superfamily MFS_1
SACT1_2835	major facilitator superfamily MFS_1
SACT1_5544	drug resistance transporter, Bcr/CflA subfamily
SACT1_5591	ABC-type transporter, integral membrane subunit
SACT1_1201	ABC-type transporter, integral membrane subunit
SACT1_2404	ABC-2 type transporter
SACT1_0870	protein of unknown function DUF803
SACT1_6933	ABC-type transporter, integral membrane subunit
SACT1_1776	ABC-type transporter, integral membrane subunit
SACT1_3213	major facilitator superfamily MFS_1
SACT1_4210	phosphate ABC transporter, inner membrane subunit PstC
SACT1_5398	protein of unknown function DUF81
SACT1_0914	NADH-ubiquinone/plastoquinone oxidoreductase chain 6
SACT1_3261	major facilitator superfamily MFS_1
SACT1_6932	ABC-type transporter, integral membrane subunit
SACT1_0669	major facilitator superfamily MFS_1
SACT1_4255	ABC-2 type transporter
SACT1_4541	major facilitator superfamily MFS_1
SACT1_4638	major facilitator superfamily MFS_1
SACT1_0913	NADH-ubiquinone oxidoreductase chain 4L
SACT1_4443	protein of unknown function DUF81
SACT1_5396	Xanthine/uracil/vitamin C permease
SACT1_0912	NADH dehydrogenase (quinone)
SACT1_2924	Lysine exporter protein (LYSE/YGGA)
SACT1_5922	putative integral membrane protein
SACT1_1243	Bile acid:sodium symporter
SACT1_6967	ABC-type transporter, integral membrane subunit
SACT1_0911	proton-translocating NADH-quinone oxidoreductase, chain M
SACT1_3931	putative ABC transporter permease protein
SACT1_2820	ABC-2 type transporter
SACT1_3298	putative integral membrane transport protein
SACT1_4871	major facilitator superfamily MFS_1
SACT1_5873	major facilitator superfamily MFS_1
SACT1_6636	putative integral membrane protein
SACT1_0905	AbgT transporter
SACT1_5532	2-dehydro-3-deoxyphosphogluconate aldolase/4-hydroxy-2-oxoglutarate aldolase
SACT1_0910	proton-translocating NADH-quinone oxidoreductase, chain N
SACT1_2115	ABC-type transporter, integral membrane subunit
SACT1_4635	protein of unknown function DUF6 transmembrane
SACT1_5777	major facilitator superfamily MFS_1
SACT1_3979	major facilitator superfamily MFS_1
SACT1_5536	major facilitator superfamily MFS_1
SACT1_6782	major facilitator superfamily MFS_1
SACT1_0616	virulence factor MVIN family protein
SACT1_4869	ABC-type transporter, integral membrane subunit
SACT1_1581	putative integral membrane protein
SACT1_4585	major facilitator superfamily MFS_1
SACT1_6536	small multidrug resistance protein
SACT1_4024	cell cycle protein
SACT1_5296	major facilitator superfamily MFS_1
SACT1_1865	major facilitator superfamily MFS_1
SACT1_4868	ABC-type transporter, integral membrane subunit
SACT1_0955	protein of unknown function DUF803
SACT1_4296	major facilitator superfamily MFS_1
SACT1_5104	major facilitator superfamily MFS_1
SACT1_0519	Bile acid:sodium symporter
SACT1_2394	putative ABC transporter permease protein
SACT1_0661	major facilitator superfamily MFS_1
SACT1_2062	major facilitator superfamily MFS_1
SACT1_4295	ABC-type transporter, integral membrane subunit
SACT1_6828	peptidase M48 Ste24p
SACT1_3446	major facilitator superfamily MFS_1
SACT1_6631	Lysine exporter protein (LYSE/YGGA)
SACT1_1048	ABC-type transporter, integral membrane subunit
SACT1_1528	protein of unknown function DUF6 transmembrane
SACT1_2016	branched-chain amino acid transport
SACT1_6154	gluconate transporter
SACT1_5051	major facilitator superfamily MFS_1
SACT1_5531	protein of unknown function DUF81
SACT1_6480	protein of unknown function DUF1290
SACT1_0373	ABC-type transporter, integral membrane subunit
SACT1_2392	putative ABC transporter permease protein
SACT1_2724	protein of unknown function DUF107
SACT1_7257	protein of unknown function UPF0118
SACT1_2772	putative integral membrane protein
SACT1_3201	NAD(P)H-quinone oxidoreductase subunit 4L
SACT1_7066	ABC-type transporter, integral membrane subunit

Some of these genes are labeled "protein of unknown function," but I think we can predict with high confidence, based on what we know about these proteins (namely, that they're hydrophobic) that the gene products in question involve membrane-associated functions.

Bioinformatics geeks, leave a comment below.

Thursday, August 15, 2013

Converting an SVG Graph to Histograms

The graphs you get from ZunZun.com (the free graphing service) are pretty neat, but one shortcoming of ZunZun is that it won't generate histograms. (Google Charts will do histograms, but unlike ZunZun, Google won't give you SVG output.) The answer? Convert a ZunZun graph to histograms yourself. It's only SVG, after all. It's XML; it's text. You just need to edit it.

Of course, nobody wants to hand-edit a zillion <use> elements (to convert data points to histogram rects). It makes more sense to do the job programmatically, with a little JavaScript.

In my case, I had a graph of dinucleotide frequencies for Clostridium botulinum coding regions. What that means is, I tallied the frequency of occurrence (in every protein-coding gene) of 5'-CpG-3', CpC, CpA, CpT, ApG, ApA, ApC, and all other dinucleotide combinations (16 in all). Since I already knew the frequency of G (by itself), A, C, and T, it was an easy matter to calculate the expected frequency of occurrence of each dinucleotide pair. (For example, A occurs with frequency 0.403, whereas G occurs with frequency 0.183. Therefore the expected frequency of occurrence of the sequence AG is 0.403 times 0.183, or 0.0738.) Bottom line, I had 16 expected frequencies and 16 actual frequencies, for 16 dinucleotide combos. I wanted side-by-side histograms of the frequencies.

First, I went to ZunZun and entered my raw data in the ZunZun form. Just so you know, this is what the raw data looked like:

0 0.16222793723642806
1 0.11352236777965981
2 0.07364933857345456
3 0.08166221769088752
4 0.123186555838253
5 0.12107590293804558
6 0.043711462078314355
7 0.03558766171971166
8 0.07364933857345456
9 0.07262685957145093
10 0.033435825941632816
11 0.03459042802303202
12 0.055925067612781175
13 0.042792101322514244
14 0.019844425842971265
15 0.02730405457750352
16 0.123186555838253
17 0.12232085101526233
18 0.055925067612781175
19 0.05502001002972254
20 0.09354077847378013
21 0.07321410524577443
22 0.03319196776961071
23 0.028600012050969865
24 0.043711462078314355
25 0.043328337600588136
26 0.019844425842971265
27 0.0062116692282947845
28 0.03319196776961071
29 0.04195172151930211
30 0.011777822917388797
31 0.015269662767317132

I made ZunZun graph the data, and it gave me back a graph that looked like this:

Which is fine except it's not a histogram plot. And it has goofy numbers on the x-axis.

I clicked the SVG link under the graph and saved an SVG copy to my local drive, then opened the file in Wordpad.

The first thing I did was locate my data points. That's easy: ZunZun plots points as a series of <use> elements. The elements are nested under a <g> element that looks like this:

<g clip-path="url(#p0c8061f7fd)">

I hand-edited this element to have an id attribute with value "DATA":

<g id="DATA" clip-path="url(#p0c8061f7fd)">

Next, I scrolled up to the very top of the file and found the first <defs> tag. Under it, I placed the following empty code block:

<script type="text/ecmascript"><![CDATA[
// code goes here

]]></script>

Then I went to work writing code (to go inside the above block) that would find the <use> elements, get their x,y values, and create <rect> elements of a height that would extend to the x-axis line.

The code I came up with looks like this:

// What is the SVG y-value of the x-axis?

// Attempt to discover by introspecting clipPath

function findGraphVerticalExtent( ) { 

   var cp = document.getElementsByTagName('clipPath')[0];

   var rect = cp.childNodes[1];

   var top = rect.getAttribute('y') * 1;

   var bottom = rect.getAttribute('height') * 1;

   return top + bottom; 

}

// This is for use with SVG graphs produced by ZunZun,

// in which data points are described in a series of

// <use> elements. We need to get the list of <use>

// nodes, convert it to a JS array, sort data points by

// x-value, and replace <use> with <rect> elements.

function changeToHistograms( ) {

   var GRAPH_VERTICAL_EXTENT = findGraphVerticalExtent( );

   // The 'g' element that encloses the 'use' elements

   // needs to have an id of "DATA" for this to work!

   // Manually edit the <g> node's id first! 

   var data = document.getElementById( "DATA" );

   // NOTE: The following line gets a NodeList object,

   // which is NOT the same as a JavaScript array!

   var nodes = data.getElementsByTagName( "use" );

   // utility routine (an inner method)

   function nodeListToJavaScriptArray( nodes ) {

       var results = [];

       for (var i = 0; i < nodes.length; i++)

          results.push( nodes[i] );

       return results;

   }

   // utility routine (another inner method)

   function compareX( a,b ) { 

       return a.getAttribute("x") * 1 - b.getAttribute("x") * 1; 

   }

   var use = nodeListToJavaScriptArray( nodes );

   // We want the nodes in x-sorted order

   use.sort( compareX );  // presto, done

   // Main loop

   for (var i = 0; i < use.length; i++) {

       var rect = 

           document.createElementNS("http://www.w3.org/2000/svg", "rect");

       var item = use[i];

       var x = item.getAttribute( "x" ) * 1;

       var y = item.getAttribute( "y" ) * 1;

       var rectWidth = 8;

       var rectHeight = GRAPH_VERTICAL_EXTENT - y;

       rect.setAttribute( "width", rectWidth.toString() );

       rect.setAttribute( "height", rectHeight.toString() );   

       rect.setAttribute( "x" , x.toString() );

       rect.setAttribute( "y" , y.toString() );

       // We will alternate colors, pink/purple

       rect.setAttribute( "style" , 

           (i%2==0)? "fill:ce8877;stroke:none" : "fill:8877dd;stroke:none" );

       data.appendChild( rect ); // add a new rect

       item.remove();  // delete the old <use> element

   }

   return use;

}

As so often happens, I ended up writing more code than I thought it would take. The above code works fine for converting data points to histogram bars (as long as you remember to give that <g> element the id attribute of "DATA" as mentioned earlier). But you need to trigger the code somehow. Answer: insert onload="changeToHistograms( )" in the <svg> element at the very top of the file.

But I wasn't done, because I also wanted to apply data labels to the histogram bars (labels like "CG," "AG," "CC," etc.) and get rid of the goofy numbers on the x-axis.

This is the function I came up with to apply the labels:

   function applyLabels( sortedNodes ) {
 
    var labels = ["aa", "ag", "at", "ac", 
      "ga", "gg", "gt", "gc", "ta", "tg", 
      "tt", "tc", "ca", "cg", "ct", "cc"];

      var data = document.getElementById( "DATA" );
 var labelIndex = 0;

 for (var i = 0; i < sortedNodes.length; i+=2) {
     var text = 
              document.createElementNS("http://www.w3.org/2000/svg", "text");
     var node = sortedNodes[i];
          text.setAttribute( "x", String( node.getAttribute("x")*1 +2) );
          text.setAttribute( "y", String( node.getAttribute("y")*1 - 13 ) );
          text.setAttribute( "style", "font-size:9pt" );
          text.textContent = labels[ labelIndex++ ].toUpperCase();
          text.setAttribute( "id", "label_" + labelIndex );
          data.appendChild( text );
      }
   }

And here's a utility function that can strip numbers off the x-axis:

   // Optional. Call this to remove ZunZun graph labels.
   // pass [1,2,3,4,5,6,7,8,9] to remove x-axis labels
   function removeZunZunLabels( indexes ) {
 
 for (var i = 0;i < indexes.length;i++) 
    try {
   document.getElementById("text_"+indexes[i]).remove();
   }
  catch(e) { console.log("Index " + i + " not found; skipped.");
   }
   }

BTW, if you're wondering why I multiply so many things by one, it's because the attribute values that comprise x and y values in SVG are String objects. If you add them, you're concatenating strings, which is not what you want. To convert a number in string form to an actual JavaScript number (so you can add numbers and not concatenate strings), you can either multiply by one or explicitly coerce a string to a number by doing Number( x ).

The final result of all this looks like:

Final graph after surgery. Expected (pink) and actual (purple) frequencies of occurrence of various dinucleotide sequences in C. botulinum coding-region DNA.

Which is approximately what I wanted to see. The labels could be positioned better, but you get the idea.

What does the graph show? Well first of all, you have to realize that the DNA of C. botulinum is extremely rich in adenine and thymine (A and T): Those two bases constitute 72% of the DNA. Therefore it's absolutely no surprise that the highest bars are those that contain A and/or T. What's perhaps interesting is that the most abundant base (A), which should form 'AA' sequences at a high rate, doesn't. (Compare the first bar on the left to the shorter purple bar beside it.) This is especially surprising when you consider that AAA, GAA, and AAT are by far the most-used codons in C. botulinum. In other words, 'AA' occurs a lot, in codons. But even so, it doesn't occur as much as one would expect.

It's also interesting to compare GC with CG. (Non-biologists, note that these two pairs are not equivalent, because DNA has a built-in reading direction. The notation GC, or equivalently, GpC, means there's a guanine sitting on the 5' side of cytosine. The notation CG means there's a guanine on the 3' side of cytosine. The 5' and 3' numbers refer to deoxyribose carbon positions.) The GC combo occurs more often than predicted by chance whereas the combination CG (or CpG, as it's also written) occurs much less frequently than predicted by chance. The reasons for this are fairly technical. Suffice it to say, it's a good prima facie indicator that C. botulinum DNA is heavily methylated. Which in fact it is.

Sunday, July 14, 2013

DNA Strand Asymmetry: More Surprises

The surprises just keep coming. When you start doing comparative genomics on the desktop (which is so easy with all the great tools at genomevolution.org and elsewhere), it's amazing how quickly you run into things that make you slap yourself on the side of the head and go "Whaaaa????"

If you know anything about DNA (or even if you don't), this one will set you back.

I've written before about Chargaff's second parity rule, which (peculiarly) states that A = T and G = C not just for double-stranded DNA (that's the first parity rule) but for bases in a single strand of DNA. The first parity rule is basic: It's what allows one strand of DNA to be complementary to another. The second parity rule is not so intuitive. Why should the amount of adenine have to equal the amount of thymine (or guanine equal cytosine) in a single strand of DNA? The conventional argument is that nature doesn't play favorites with purines and pyrimidines. There's no reason (in theory) why a single strand of DNA should have an excess of purines over pyrimidines or vice versa, all things being equal.

But it turns out, strand asymmetry vis-a-vis purines and pyrimidines is not only not uncommon, it's the rule. (Some call it Szybalski's rule, in fact.) You can prove it to yourself very easily. If you obtain a codon usage chart for a particular organism, then add the frequencies of occurrence of each base in each codon, you can get the relative abundances of the four bases (A, G, T, C) for the coding regions on which the codon chart was based. Let's take a simple example that requires no calculation: Clostridium botulinum. Just by eyeballing the chart below, you can quickly see that (for C. botulinum) codons using purines A and G are way-more-often used than codons containing pyrimidines T and C. (Note the green-highlighted codons.)

If you do the math, you'll find that in C. botulinum, G and A (combined) outnumber T and C by a factor of 1.41. That's a pretty extreme purine:pyrimidine ratio. (Remember that we're dealing with a single strand of DNA here. Codon frequencies are derived from the so-called "message strand" of DNA in coding regions.)

I've done this calculation for 1,373 different bacterial species (don't worry, it's all automated), and the bottom line is, the greater the DNA's A+T content (or, equivalently, the less its G+C content), the greater the purine imbalance. (See this post for a nice graph.)

If you inspect enough codon charts you'll quickly realize that Chargaff's second parity rule never holds true (except now and then by chance). It's a bogus rule, at least in coding regions (DNA that actually gets transcribed in vivo). It may have applicability to pseudogenes or "junk DNA" (but then again, I haven't checked; it may well not apply there either).

If Chargaff's second rule were true, we would expect to find that G = C (and A = T), because that's what the rule says. I went through the codon frequency data for 1,373 different bacterial species and then plotted the ratio of G to C (which Chargaff says should equal 1.0) for each species against the A+T content (which is a kind of phylogenetic signature) for each species. I was shocked by what I found:

Using base abundances derived from codon frequency data, I calculated G/C for 1,373 bacterial species and plotted it against total A+T content. (Each dot represents a genome for a particular organism.) Chargaff's second parity rule predicts a horizontal line at y=1.0. Clearly, that rule doesn't hold.

I wasn't so much shocked by the fact that Chargaff's rule doesn't hold; I already knew that. What's shocking is that the ratio of G to C goes up as A+T increases, which means G/C is going up even as G+C is going down. (By definition, G+C goes down as A+T goes up.)

Chargaff says G/C should always equal 1.0. In reality, it never does except by chance. What we find is, the less G (or C) the DNA has, the greater the ratio of G to C. To put it differently: At the high-AT end of the phylogenetic scale, cytosine is decreasing faster (much faster) than guanine, as overall G+C content goes down.

When I first plotted this graph, I used a linear regression to get a line that minimizes the sum of squared absolute error. That line turned out to be given by 0.638 + [A+T]. Then I saw that the data looked exponential, not linear. So I refitted the data with a power curve (the red curve shown above) given by

G/C = 1.0 + 0.587*[A+T] + 1.618*[A+T]²

which fit the data even better (minimum summed error 0.1119 instead of 0.1197). What struck me as strange is that the Golden Ratio (1.618) shows up in the power-curve formula (above), but also, the linear form of the regression has G/C equaliing 1.638 when [A+T] goes to 1.0. Which is almost the Golden Ratio.

In a previous post, I mentioned finding that the ratio A/T tends to approximate the Golden Ratio as A+T approaches 1.0. If this were to hold true, it could mean that A/T and G/C both approach the Golden Ratio as A+T approaches 1.0, which would be weird indeed.

For now, I'm not going to make the claim that the Golden Ratio figures into any of this, because it reeks too much of numerology and Intelligent Design (and I'm a fan of neither). I do think it's mildly interesting that A/T and G/C both approach a similar number as A+T approaches unity.

Comments, as usual, are welcome.