Friday, May 09, 2014

A JavaScript Tip for Biohackers

Today I want to give a short bio-hacking code lesson, and I want to keep it as simple as possible in case you're just starting to use JavaScript and want to see what kinds of things it can do. This lesson could prove handy to you even if you're not a gene hacker, because it shows how easy it is to manipulate text in the browser.

For this example, I am going to assume that you have a text file open in the browser. In particular, I'm going to assume the text file consists of a listing of two genes sequences in FASTA format, which is actually very simple: FASTA consists of a header line that starts with the > (greater than) angle bracket, followed by one or more lines of AGTC base sequence data (if it's a nucleotide sequence) or a bunch of letters (KPRMIV etc.) if it's a protein sequence. For example:

>M. leprae MBLr_2224...    <-- this is the header
ATGGCGGTGCTGGATGTC...      <-- this is the data

Your file might have several genes in FASTA format, one after the other. The question is: How can you read this data with JavaScript?

Actually, it's extremely easy to read text data. Open any text file with your browser, then open a JavaScript window: With Firefox, use Shift-F4 to bring up the Scratchpad, or with Chrome use Control-Shift-J. To read the text into a JavaScript variable called text, just type the following line into the script editor:

text = document.body.textContent;

Execute this code with Control-L in Firefox or by simply entering a carriage return in Chrome. (If Firefox fills your Scratchpad with text, bracketed by /* and */,  that's good; it means the code worked. Delete it and proceed.)

Suppose you have several FASTA records in a row and you want to have an array of gene data. The easy thing to do is "split" the records at each header, discarding the header:

r = />[^\n]+\n/g;
genes = text.split( r );

NOTE: To enter more than one line of code in the Chrome console, you have to hold the Shift key down before hitting Enter. Otherwise, Enter executes the code.

The first line defines a regular expression (r) for the pattern: "greater-than symbol followed by one or more non-newline symbols, followed by a newline." (The caret symbol ^ means to negate whatever's in between the square brackets.) When this code executes, genes will be an array of data, but because of the way split() works, the first item (item zero) in the array will be empty, so get rid of it with:


Now the genes array will contain gene data. The data for gene No. 1 will be in gene[0], the data for gene No. 2 will be in gene[1], etc.

Incidentially, if you want an array of headers, just do:

headers = text.match( r );

No need to do headers.shift(). The match() operation creates an exact array.

If your genes are aligned, you can compare them, base to base, in a loop. (If your genes are not aligned, create an alignment using an online ClustalW alignment tool or using the popular Mega6 program.) The following loop construct compares the first 300 bases in two genes, and tallies the differences according to whether the difference occurred in codon base one, base two, or base three:

snp=[0,0,0]; // array to hold base 1,2,3 results
gene1 = genes[0];
gene2 = genes[1];
for (var i=0; i < 300; i++)
   snp[ i % 3 ] += gene1[i] != gene2[i];

You can display the results in the console simply by adding (on its own line) snp; or console.log(snp). Or if you want to see it in a dialog, execute alert(snp).

The final line of code deserves explanation. Results are placed in the snp (single nucleotide polymorphism) array according to whether the "hit" occurred in base 1, base 2, or base 3 of a codon. The i % 3 construct (i modulo 3) means divide i by 3 and throw away the "answer" but keep the remainder. (So for example, 5 % 3 equals 2, 6 % 3 equals zero, 7 % 3 equals 1, etc.) As i increments, i % 3 simply takes on values of 0, 1, 2, 0, 1, 2, etc.

On the right side of the equals sign we have gene1[i] != gene2[i]. This means we want to compare the two genes at an offset of i, and if they are not equal (!=), tally the resulting value (true, or 1) numerically. The numeric result is added to the appropriate slot in snp. Note that JavaScript treats true as having a numeric value of one and false as having a numeric value of zero. It knows to cast the boolean value to a number because of the += symbol (which means numerically add the following value to the existing value of the lefthand variable).

I hope this short tutorial wasn't too painful. If you're new to JavaScript, my advice is: Experiment in the console (Firefox's Scratchpad or Chrome's JS console) a lot, and get familiar with the string functions split() and match(), because they can be incredibly useful, not to mention speedy. Either of those two functions can take a string argument or a regex. But remember to put a 'g' after the regex if you want the operation to occur globally throughout the string. For example:

"abcdefabcdef".match( /abc/ )

will only match the first occurrence of abc, whereas

"abcdefabcdef".match( /abc/g )

will match both occurrences of abc and give you an array of matches.