Showing posts with label ZunZun. Show all posts
Showing posts with label ZunZun. Show all posts

Thursday, August 15, 2013

Converting an SVG Graph to Histograms

The graphs you get from ZunZun.com (the free graphing service) are pretty neat, but one shortcoming of ZunZun is that it won't generate histograms. (Google Charts will do histograms, but unlike ZunZun, Google won't give you SVG output.) The answer? Convert a ZunZun graph to histograms yourself. It's only SVG, after all. It's XML; it's text. You just need to edit it.

Of course, nobody wants to hand-edit a zillion <use> elements (to convert data points to histogram rects). It makes more sense to do the job programmatically, with a little JavaScript.

In my case, I had a graph of dinucleotide frequencies for Clostridium botulinum coding regions. What that means is, I tallied the frequency of occurrence (in every protein-coding gene) of 5'-CpG-3', CpC, CpA, CpT, ApG, ApA, ApC, and all other dinucleotide combinations (16 in all). Since I already knew the frequency of G (by itself), A, C, and T, it was an easy matter to calculate the expected frequency of occurrence of each dinucleotide pair. (For example, A occurs with frequency 0.403, whereas G occurs with frequency 0.183. Therefore the expected frequency of occurrence of the sequence AG is 0.403 times 0.183, or 0.0738.) Bottom line, I had 16 expected frequencies and 16 actual frequencies, for 16 dinucleotide combos. I wanted side-by-side histograms of the frequencies.

First, I went to ZunZun and entered my raw data in the ZunZun form. Just so you know, this is what the raw data looked like:

0 0.16222793723642806
1 0.11352236777965981
2 0.07364933857345456
3 0.08166221769088752
4 0.123186555838253
5 0.12107590293804558
6 0.043711462078314355
7 0.03558766171971166
8 0.07364933857345456
9 0.07262685957145093
10 0.033435825941632816
11 0.03459042802303202
12 0.055925067612781175
13 0.042792101322514244
14 0.019844425842971265
15 0.02730405457750352
16 0.123186555838253
17 0.12232085101526233
18 0.055925067612781175
19 0.05502001002972254
20 0.09354077847378013
21 0.07321410524577443
22 0.03319196776961071
23 0.028600012050969865
24 0.043711462078314355
25 0.043328337600588136
26 0.019844425842971265
27 0.0062116692282947845
28 0.03319196776961071
29 0.04195172151930211
30 0.011777822917388797
31 0.015269662767317132


I made ZunZun graph the data, and it gave me back a graph that looked like this:



Which is fine except it's not a histogram plot. And it has goofy numbers on the x-axis.

I clicked the SVG link under the graph and saved an SVG copy to my local drive, then opened the file in Wordpad.

The first thing I did was locate my data points. That's easy: ZunZun plots points as a series of <use> elements. The elements are nested under a <g> element that looks like this:

<g clip-path="url(#p0c8061f7fd)">

I hand-edited this element to have an id attribute with value "DATA":

<g id="DATA" clip-path="url(#p0c8061f7fd)">

Next, I scrolled up to the very top of the file and found the first <defs> tag. Under it, I placed the following empty code block:

<script type="text/ecmascript"><![CDATA[
// code goes here

]]></script>

Then I went to work writing code (to go inside the above block) that would find the <use> elements, get their x,y values, and create <rect> elements of a height that would extend to the x-axis line.

The code I came up with looks like this:



// What is the SVG y-value of the x-axis?
// Attempt to discover by introspecting clipPath

function findGraphVerticalExtent( ) {
   var cp = document.getElementsByTagName('clipPath')[0];
   var rect = cp.childNodes[1];
   var top = rect.getAttribute('y') * 1;
   var bottom = rect.getAttribute('height') * 1;
   return top + bottom;
}


// This is for use with SVG graphs produced by ZunZun,
// in which data points are described in a series of
// <use> elements. We need to get the list of <use>
// nodes, convert it to a JS array, sort data points by
// x-value, and replace <use> with <rect> elements.

function changeToHistograms( ) {

   var GRAPH_VERTICAL_EXTENT = findGraphVerticalExtent( );

   // The 'g' element that encloses the 'use' elements
   // needs to have an id of "DATA" for this to work!
   // Manually edit the <g> node's id first!
   var data = document.getElementById( "DATA" );

   // NOTE: The following line gets a NodeList object,
   // which is NOT the same as a JavaScript array!
   var nodes = data.getElementsByTagName( "use" );

   // utility routine (an inner method)
   function nodeListToJavaScriptArray( nodes ) {

       var results = [];

       for (var i = 0; i < nodes.length; i++)
          results.push( nodes[i] );

       return results;
   }

   // utility routine (another inner method)
   function compareX( a,b ) {
       return a.getAttribute("x") * 1 - b.getAttribute("x") * 1;
   }

   var use = nodeListToJavaScriptArray( nodes );

   // We want the nodes in x-sorted order
   use.sort( compareX ); // presto, done

   // Main loop
   for (var i = 0; i < use.length; i++) {

       var rect =
           document.createElementNS("http://www.w3.org/2000/svg", "rect");
       var item = use[i];
       var x = item.getAttribute( "x" ) * 1;
       var y = item.getAttribute( "y" ) * 1;
       var rectWidth = 8;
       var rectHeight = GRAPH_VERTICAL_EXTENT - y;
       rect.setAttribute( "width", rectWidth.toString() );
       rect.setAttribute( "height", rectHeight.toString() );
       rect.setAttribute( "x" , x.toString() );
       rect.setAttribute( "y" , y.toString() );

       // We will alternate colors, pink/purple
       rect.setAttribute( "style" ,
           (i%2==0)? "fill:ce8877;stroke:none" : "fill:8877dd;stroke:none" );

       data.appendChild( rect ); // add a new rect
       item.remove(); // delete the old <use> element
   }

   return use;
}

As so often happens, I ended up writing more code than I thought it would take. The above code works fine for converting data points to histogram bars (as long as you remember to give that <g> element the id attribute of "DATA" as mentioned earlier). But you need to trigger the code somehow. Answer: insert onload="changeToHistograms( )" in the <svg> element at the very top of the file.

But I wasn't done, because I also wanted to apply data labels to the histogram bars (labels like "CG," "AG," "CC," etc.) and get rid of the goofy numbers on the x-axis.

This is the function I came up with to apply the labels:


   function applyLabels( sortedNodes ) {
 
    var labels = ["aa", "ag", "at", "ac", 
      "ga", "gg", "gt", "gc", "ta", "tg", 
      "tt", "tc", "ca", "cg", "ct", "cc"];

      var data = document.getElementById( "DATA" );
 var labelIndex = 0;

 for (var i = 0; i < sortedNodes.length; i+=2) {
     var text = 
              document.createElementNS("http://www.w3.org/2000/svg", "text");
     var node = sortedNodes[i];
          text.setAttribute( "x", String( node.getAttribute("x")*1 +2) );
          text.setAttribute( "y", String( node.getAttribute("y")*1 - 13 ) );
          text.setAttribute( "style", "font-size:9pt" );
          text.textContent = labels[ labelIndex++ ].toUpperCase();
          text.setAttribute( "id", "label_" + labelIndex );
          data.appendChild( text );
      }
   }


And here's a utility function that can strip numbers off the x-axis:

   // Optional. Call this to remove ZunZun graph labels.
   // pass [1,2,3,4,5,6,7,8,9] to remove x-axis labels
   function removeZunZunLabels( indexes ) {
 
 for (var i = 0;i < indexes.length;i++) 
    try {
   document.getElementById("text_"+indexes[i]).remove();
   }
  catch(e) { console.log("Index " + i + " not found; skipped.");
   }
   } 
  
BTW, if you're wondering why I multiply so many things by one, it's because the attribute values that comprise x and y values in SVG are String objects. If you add them, you're concatenating strings, which is not what you want. To convert a number in string form to an actual JavaScript number (so you can add numbers and not concatenate strings), you can either multiply by one or explicitly coerce a string to a number by doing Number( x ).

The final result of all this looks like:


Final graph after surgery. Expected (pink) and actual (purple) frequencies of occurrence of various dinucleotide sequences in C. botulinum coding-region DNA.

Which is approximately what I wanted to see. The labels could be positioned better, but you get the idea.

What does the graph show? Well first of all, you have to realize that the DNA of C. botulinum is extremely rich in adenine and thymine (A and T): Those two bases constitute 72% of the DNA. Therefore it's absolutely no surprise that the highest bars are those that contain A and/or T. What's perhaps interesting is that the most abundant base (A), which should form 'AA' sequences at a high rate, doesn't. (Compare the first bar on the left to the shorter purple bar beside it.) This is especially surprising when you consider that AAA, GAA, and AAT are by far the most-used codons in C. botulinum. In other words, 'AA' occurs a lot, in codons. But even so, it doesn't occur as much as one would expect.

It's also interesting to compare GC with CG. (Non-biologists, note that these two pairs are not equivalent, because DNA has a built-in reading direction. The notation GC, or equivalently, GpC, means there's a guanine sitting on the 5' side of cytosine. The notation CG means there's a guanine on the 3' side of cytosine. The 5' and 3' numbers refer to deoxyribose carbon positions.) The GC combo occurs more often than predicted by chance whereas the combination CG (or CpG, as it's also written) occurs much less frequently than predicted by chance. The reasons for this are fairly technical. Suffice it to say, it's a good prima facie indicator that C. botulinum DNA is heavily methylated. Which in fact it is.

Saturday, August 10, 2013

How to Customize a ZunZun SVG Graph

[Update 7 Mar 2015: Sadly, it appears ZunZun.com has gone away. Which is too bad. It was the greatest graphic site ever.]

Being able to create high-quality graphs of data (for free) at ZunZun.com is great, but to me, what really makes ZunZun a great service is the fact that it will give you a graph in SVG format. SVG is super-powerful and super-flexible, and it's all just markup (XML), which means you don't need to own Illustrator or Photoshop in order to customize it. You just need Wordpad, Notepad, or any old text editor. And believe me, you can do some pretty fantastic things in SVG with nothing more than Wordpad as an editing tool.

In a previous post, I showed three graphs of amino acid content from the Hsp40 heat-shock proteins of three diverse groups of organisms. Because the graphs are SVG, I can combine them into a single graph very easily in Wordpad with no need to haul out Photoshop or Illustrator. That's what I did here (NOTE: the following inline SVG image will not be visible in IE; try a Web-standards-compliant browser):

Arthropods Prokaryotes Plants Obviously, before you combine two or more graphs, you should be certain the axis scales are identical in the graphs. (ZunZun will let you constrain x- and y-axis bounds.) Once you have two ZunZun graphs in SVG format, all you need to do is Copy the data points from one graph and Paste them into the other using Wordpad. The points will be in a big long list of <use> elements, each containing the mysterious notation xlink:href="#m4920679963". The latter is a back-reference (an XLink reference) to a previously defined graphic element having an id attribute with value m4920679963. You'll find the element in question in a <defs> block near the beginning of the file. Find that reference and do a global search-and-replace on m4920679963, replacing it with something sensible like "CIRCLES."  That's what the default data points are: little circles.

Whenever you paste new data points (from another SVG graph) into a new graph, and you want the new points to be visibly different from the preexisting points (for example, maybe you want one set of points to be little red dots and the other points to show as tiny black triangles), you need to go into the <defs> element at the top of the file and create a graphic primitive for the new data-point shape you want and then give it a distinct id value that you can use in XLink references later on. Let me show you how that's done.

The above graph uses green triangles for points derived from plant data (original, I know), red dots for insect data, and peach-colored squares for the bacterial data points. Here are the primitives I came up with for the data points:



<!-- TRIANGLES -->

<path d="
M0 0 
L1 0 
L.5 -.87
z " id="TRIANGLES" transform="scale(8.4)"
 style="fill:#00cc38;stroke:#000000;stroke-linecap:butt;stroke-width:0.1;"/>

<!-- SQUARES -->

<path d="
M0 0 
L1 0 
L1 1 
L0 1
z " id="SQUARES" transform="scale(4)"
 style="fill:#CFaF22;stroke:#000000;stroke-linecap:butt;stroke-width:0.21;"/>

<!-- CIRCLES -->

     <path d="
M0 1.5
C0.397805 1.5 0.77937 1.34195 1.06066 1.06066
C1.34195 0.77937 1.5 0.397805 1.5 0
C1.5 -0.397805 1.34195 -0.77937 1.06066 -1.06066
C0.77937 -1.34195 0.397805 -1.5 0 -1.5
C-0.397805 -1.5 -0.77937 -1.34195 -1.06066 -1.06066
C-1.34195 -0.77937 -1.5 -0.397805 -1.5 0
C-1.5 0.397805 -1.34195 0.77937 -1.06066 1.06066
C-0.77937 1.34195 -0.397805 1.5 0 1.5
z
" id="CIRCLES" transform="scale(1.7)"
 style="fill:#ee3344;stroke:#000000;stroke-linecap:butt;stroke-width:0.25;"/>

The syntax is way easier than it looks. Example: To make a triangle, I use a <path>  element containing the seemingly peculiar drawing commands M0 0 L1 0 L.5 -.87 z. Commands M and L mean moveto and lineto, while z means close off the shape by stroking a line from the present position to the start position (and also fill the shape) Thus, M0 0 means move to the origin (0, 0). That's the lower left corner of the triangle. L1 0 means stroke a line to x=1 and y=0, the right corner of the triangle. L.5 -.87 of course means draw a line to x,y = (0.5, -0.87). The apex of an equilateral unit triangle has an x-value of 0.5, obviously. The y-value of the apex is the sine of 60 degrees, namely 0.87, only in this case we have to make it a negative number, -0.87. Why? Because in SVG, the y-axis starts at zero and gets increasingly positive as you go down the image, not up. This is probably the single biggest source of confusion in SVG: To move a point higher. decrease its y-value. (Don't worry, you'll get used to it.)

For some weird reason, ZunZun's software spits out a huge mass of curveto commands in order to draw a circle, instead of using SVG's built-in <circle> primitive. Go figure.

Now that you know how to draw polygons in SVG, you might want to try drawing a few. Try creating an X or a cross with lineto commands (but don't use z at the end, unless you want the shape filled in addition to stroked).

Suppose you want to add a legend to your graph, as I've done in the upper right corner of the above graph. Here's the markup for it:



<!-- LEGEND -->
<g transform="translate(292 -2)">
 <use style="fill:#fefeff;stroke:#000000;stroke-linecap:butt;stroke-width:0.5;" 
x="145" xlink:href="#CIRCLES" 
y="58"/>
 <text x="153" y="60" 
        font-family="Courier" 
        font-size="11">
    Arthropods
  </text>


<use style="fill:#a0a0a0;stroke:#000000;stroke-linecap:butt;stroke-width:0.5;" 
x="144" 
y="68"
xlink:href="#SQUARES" />
 <text x="153" y="73" 
        font-family="Courier" 
        font-size="11">
    Prokaryotes
  </text>


<use transform="scale(.7)" style="fill:#fefeff;stroke:#000000;stroke-linecap:butt;stroke-width:0.25;" 
x="205"  
y="121" 
xlink:href="#TRIANGLES" />
 <text x="153" y="86" 
        font-family="Courier" 
        font-size="11">
    Plants
</text>
</g>

Notice carefully that the whole thing is wrapped in a <g> element, which is a convenience element for marking off arbitrary blocks of content, equivalent to <div> in HTML. That is to say, by itself <g> draws nothing. Why use it, then? Well, look at what I did: I placed a transform attribute inside it, in order to move (translate) everything contained in the <g> block, as a unit. Being able to position multiple items at once this way is tremendously convenient. (Are you starting to feel the power of SVG yet?)

As you can see, I use the XLink mechanism to back-reference the data point primitives (triangles, squares, circles). I also use text elements to place plain old text labels.

I hope I've been able to give you some idea of the power and flexibility of SVG with the examples shown here. SVG is an extremely sophisticated and capable medium, particularly if you start using JavaScript to modify DOM elements dynamically. It can do quite a bit more than I've shown here. In fact there's really no limit to what you can do with SVG. The limit is your imagination. If you don't believe it, check out tomorrow's blog.

Friday, August 09, 2013

More ZunZun Graphs and Biohacking

A few days ago I gave a detailed front-to-back tutorial on how to do a bit of bio-hacking. I showed how to quickly get the amino acid sequences for 25 versions of the Hsp40 (DnaJ) heat shock protein, as produced by 25 different organisms, then create a graph of arginine content versus lysine content for the 25 organisms (all bacterial and Archaea). The resulting graph looked something like this:

Lysine and arginine (mole-fraction) in the DnaJ protein of 50 microorganisms.

In this particular graph there are 50 points, because I went back to UniProt and added 25 more organisms to the mix, to see if the trend line would hold true. (The correlation actually got stronger: r=0.82.) The graph clearly shows that as lysine concentration goes up, arginine concentration goes down. Does this graph prove that lysine can take the place of arginine in DnaJ? That's not exactly what the graph says, although it's a worthwhile hypothesis. To check it out, you'd want to look at some AA sequence alignments to see if arginine and lysine are, in fact, replacing each other in the exact same spots in the protein, across organisms. Certainly it would be reasonable for lysine to replace arginine. Both are polar, positively charged amino acids.

I should point out that the organisms in this graph vary greatly in genomic G+C content. The codon AAA represents lysine, and experience has taught me (maybe you've noticed this too, if you're a biogeek) that lysine usage is a pretty reliable proxy for low G+C content. If you check the codon usage tables for organisms with low-GC genomes, you'll see that they use lysine more than any other amino acid. In Clostridium botulinum, AAA accounts for 8% of codon usage. In Buchnera it's 9%. Low G+C means high lysine usage.

Likewise, organisms with low G+C quite often have a disproportionately low frequency of 5'-CpG-3' dinucleotides in their DNA (much lower than would occur by chance). The technical explanation for this is interesting, but I'll leave it for another day. Suffice it to say, organisms with low CpG tend, by definition, not to use very many codons that begin with CG, all of which code for (guess which amino acid?) arginine.

To see if the arg-lys inverse relationship holds for higher organisms, I gathered up DnaJ sequences for 25 plants. The results:

Same idea as above, but this time using data for 50 plants (instead of bacteria).

Same negative relationship. However, note one thing: The scale of the graph's axes do not match the scale of the axes in the previous graph (further above). In this graph, we're seeing a very narrow range of frequencies for both Arg and Lys. Fortunately, ZunZun's interface makes it easy for us to re-plot the plant data using the same axis scaling as we had for our bacterial data. By constraining the x- and y-axis limits, we can re-visualize this data in an apples-to-apples context:

The plant data, re-plotted with the same axis scaling as used in the bacterial plot further above.

If you compare this graph carefully to the first graph, at the top of the page, you'll see that the points lie on pretty much the same line.

Just for fun, I went back to the UnitProt site and searched for DnaJ in insects (Arthropoda, which technically also subsumes crustacea). Then I plotted the bug data using the same x- and y-axis scaling as for the bacteria:


Same graph, this time for 62 insect DnaJ sequences.
The insect points tend to cluster higher in the graph, and much further to the left than for plants, indicating that the arthropods seem to like Arg and don't much care to use Lys. The moral, I suppose, is that if your diet is lacking in arginine, you should eat fewer plants and more insects.