Wednesday, February 11, 2009

Data URLs to TinyURLs, and vice versa

Last night, I made an exciting discovery.

I discovered that you can convert data URLs (RFC 2397) to TinyURLs, which means you can poke a small Gif or PNG image or anything else that can be made into a data URL into the TinyURL database for later recovery. That means you can poke text, XML, HTML, or anything else that has a discrete mime type, into TinyURL (and do it without violating their Terms of Service; read on).

If you're not familiar with how TinyURLs work: The folks at TinyURL.com have a database. When you send them a long URL (via the HTML form on their web site), they store that long URL in their database and hand you back a short URL. Later, you can point your browser at the tiny URL. The TinyURL folks take your incoming request, look at it, fetch the corresponding long URL from the database, and redirect your browser to the long-URL address.

Think about what this means, though. In essence, you're getting database storage for free, courtesy of TinyURL.com. Of course, you can't just poke anything you want into their database: According to the Terms of Service, the TinyURL service can only be used for URLs.

But according to IETF RFC 2397, "data:" is a legitimate scheme and data-URLs are bonafide URLs. And the HTML 4 spec (Section 13.1.1) specifically mentions data-URLs. I take this to mean data URLs are in fact URLs, and can therefore be stored at TinyURL.com without violating the TinyURL Terms of Service.

This leads to an interesting use-case or two. Traditionally, people have talked about data-URLs in the context of encoding small Gif images and such. Data URLs never caught on, because IE7-and-earlier provided poor support for them, and even today, IE8 (which does support some data URLs) imposes security constraints that make it hard to IE users to deal with all possible varieties of data URL. But IE is the exception. All other modern browsers have built-in support for data URLs.

It's important to understand, you aren't limited to using data URLs to express just tiny images. Anything that can be urlencoded (and that has a well-known mime type) can be expressed as a data URL. Here is a JavaScript function for converting HTML markup to a data URL:

function toDataURL( html ) { // convert markup to a data URL

var preamble = "data:text/html;charset=utf-8,";
var escapedString = escape( html );
return preamble + escapedString;
}
Try this simple experiment. Run the above code in the Firebug console (if you use the Firebug extension for Firefox), passing it an argument of

"<html>" + document.documentElement.innerHTML + "</html>"

which will give you the data URL for the currently visible page. Of course, if you try to navigate to the resulting data URL, it may not render correctly if the page contains references to external resources (scripts, CSS, etc.) using relative URLs, because now the "host" has changed and the relative URLs won't work. Even so, you should at least be able to see all the page's text content, with any inlined styles rendered correctly.

Still not getting it? Try going to the following URL (open it in a new window):

http://tinyurl.com/c7ug9a

(Note to Internet Explorer users: Don't expect this to work in your browser.)

You should see the web page for IETF's RFC 2119. However, note carefully, you're not visiting the IETF site. (Look in your browser's address bar. It's a data URL.) The entire page is stored at TinyURL.com and is being delivered out of their database.

Obviously I don't advocate storing other people's web content at TinyURL.com; this was just a quick example to illustrate the technique.

One thing that's quite interesting (to me) is that unlike other "URL rewriting" services, the TinyURL folks don't seem to mind if your URL is quite long. I haven't discovered the upper limit yet. What you'll find, I think, is that the practical upper limit is set by your browser. I seem to recall that Mozilla has a hard limit of 8K on data-URL length (someone please correct me). It's browser-implementation dependent.

Here are some possible use-cases for TinyURL data-URL usage:
  • Encode a longer-than-140-characters comment that you want to point Twitter followers to. Storing it at TinyURL means you don't have to host the comment on your own site.

  • You could create simple blog pages that only come from TinyURL's database. Domainless hosting.

  • You could encode arbitrary XML fragments as data-URLs and store them in TinyURL, then retrieve them as needed via Greasemonkey AJAX calls. (This would be a cool way to store SVG images.)

  • You could passivate JavaScript objects as JSON, convert JSON objects to data-URLs, and store them in the TinyURL database for later use.

I'm sure there are many other possibilities. (Maybe you can post some in a comment on this blog?)

Someone will say "But isn't TinyPaste or ShortText.com designed for exactly this sort of thing? Why use TinyURL?" The answer is, with TinyURL, you get back the actual resource, not a web page containing a bunch of ads and CSS and other cruft wrappering your content. With data URLs, the URL is the content.

Please retweet this if you find the idea interesting, and let me know what you decide to build with it. (Of course, after this blog, TinyURL folks may decide to modify their Terms of Service. But let's hope not.)

Thursday, February 05, 2009

OpenCalais-OpenOffice mashup

OpenCalais is one of the most innovative and potentially disruptive online services to hit the Web in recent memory. To understand its importance, you have to be a bit of a geek, preferably a text-analytics or computational-linguistics geek, maybe an information-access or "search" geek, or a reasonably technical content-technology freak who understands the potential uses of metadata. It's not easy to sum up OpenCalais in a few words. Suffice it to say, though, if you haven't heard of OpenCalais before, you should visit http://www.opencalais.com. It's an interesting undertaking, to be sure.

One of the services OpenCalais exposes is automatic extraction of entity metadata from text. If you call the OpenCalais service using the proper arguments, you can essentially pass it any kind of text content you want (an article, a blog, a Wikipedia entry, an Obama speech, whatever) and the service will hand you back an itemized list of the entities it detected in the text. "Entities" means things like names of persons, cities, states or provinces, countries, prices, e-mail addresses, industry terms -- almost anything that would qualify as a "proper noun" or a term with special significance (not just keywords).

The OpenCalais service brings back more than a list of terms. It also reports the number of occurrences of the terms and a relevancy score for each term. The latter is a measure of the relative semantic importance of the term in question to the text in question. This score can help in determining cut-offs for automatic tagging, ordering of metadata in tag clouds, and other purposes. It's a way to get at the "aboutness" of a document.

OpenCalais does many, many things beyond entity extraction. But you should already be able to imagine the many downstream disruptions that could occur, for example, in enterprise search if Text-Analytics-as-a-Service (or heaven forbid, Machine-Learning-as-a-Service) were to catch on bigtime.

The OpenCalais API is still growing and evolving (it's at an early stage), but it's already amazingly powerful, yet easy to use. Writing a semantic AJAX app is a piece of cake.

My first experiment with OpenCalais involved OpenOffice. I use OpenOffice intensively (as a direct replacement for the Microsoft Office line of shovelware), and although OpenOffice (like Office) has more than its fair share of annoyances, it also has some features that are just plain crazy-useful, such as support for Macros written in any of four languages (Python, Basic, beanshell, and JavaScript). The JavaScript binding is particularly useful, since it's implemented in Java and allows you to tap the power of the JRE. But I'm getting ahead of myself.

What I decided to try to do is create an OpenOffice macro that would let me push a button and have instant entity-extraction. Here's the use case: I've just finished writing a long business document using OpenOffice, and now I want to develop entity metadata for the document so that it's easier to feed into my company's Lucene-based search system and shows up properly categorized on the company intranet. To make it happen, I want (as a user) to be able to highlight (select) any portion of the document's text, or all of it, then click a button and have OpenOffice make a silent AJAX call to the OpenCalais service. Two seconds later, the metadata I want appears, as if by magic, at the bottom of the last page of the document, as XML. (Ideally, after reviewing the XML, I would be able to click an "Accept" button and have the XML vanish into the guts of the .odf file.)

I wrote a 160-line script that does this. The source code is posted at http://sites.google.com/site/snippetry/Home/opencalais-macro-for-openoffice. Please note that the code won't work for you until you get your own OpenCalais license key and plug it into the script at line No. 145. For space reasons, I'm not going to explain how to install an OpenOffice macro (or create a toolbar button for it after it's installed). That's all standard OpenOffice stuff.

The key to understanding the OpenCalais macro is that all we're doing is performing an HTTP POST programmatically using Java called from JavaScript. Remember that the JavaScript engine in OpenOffice is actually the same Rhino-based engine that's part of the JRE. This means you can instantiate a Java object using syntax like:

var url = new java.net.URL( "http://www.whatever.url" );

Opening a connection and POSTing data to a remote site over the wire is straightforward, using standard Java conventions. The only tricky part is crafting the parameters expected by OpenCalais. It's all well-documented on the OpenCalais site, fortunately. Lines 105-120 of the source code show how to query OpenCalais for entity data. You have to send a slightly ungainly chunk of XML in your POST. No big deal.

For testing purposes, I ran my script against text that I cut and pasted into OpenOffice from an Associated Press news story about Bernard Madoff's customer list (the investment advisor who showed his clients how to make a small fortune out of a large one). OpenCalais generated the following metadata in roughly three seconds:

<!-- Use of the Calais Web Service is governed by the Terms of Service located at http://www.opencalais.com. By using this service or the results of the service you agree to these terms of service. -->

<!--City: NEW YORK, Danbury, Brookline, Oceanside, Pembroke Pines, West Linn, Company: Associated Press, CNN, Country: Switzerland, Kenya, Cayman Islands, Currency: USD, Event: Person Communication and Meetings, Facility: Wall Street, World Trade Center, Hall of Fame, IndustryTerm: Internet support group, MedicalCondition: brain injury, Movie: World Trade Center, NaturalFeature: San Francisco Bay, Long Island, Organization: U.S. Bankruptcy Court, Person: Bernard Madoff, Alan English, Patricia Brown, Bob Finkin, Bonnie Sidoff, Evelyn Rosen, Teri Ryan, Lynn Lazarus Serper, Sharon Cohen, Sandy Koufax, Jordan Robertson, Neill Robertson, Samantha Bomkamp, ADAM GELLER, John Malkovich, Bernie Madoff, Allen G. Breed, Larry King, Rita, Mike, Nancy Fineman, Larry Silverstein, ProvinceOrState: Florida, Oregon, New York, Massachusetts, Connecticut, Technology: ADAM, --><OpenCalaisSimple>

<Description>

<calaisRequestID>a1d28b3b-4ef7-4aa6-b293-8df46ea5e988</calaisRequestID>

<id>http://id.opencalais.com/KG8hyw2LGKjgRnJnRN86FQ</id>

<about>http://d.opencalais.com/dochash-1/6010f15f-bb32-3e59-9b55-c8fef29d38ed</about>

</Description>

<CalaisSimpleOutputFormat>

<Person count="38" relevance="0.771">Bernard Madoff</Person>

<Person count="20" relevance="0.606">Alan English</Person>

<Person count="16" relevance="0.370">Patricia Brown</Person>

<Currency count="11" relevance="0.686">USD</Currency>

<Person count="10" relevance="0.524">Bob Finkin</Person>

<Person count="9" relevance="0.586">Bonnie Sidoff</Person>

<Person count="8" relevance="0.574">Evelyn Rosen</Person>

<Person count="6" relevance="0.378">Teri Ryan</Person>

<Person count="5" relevance="0.373">Lynn Lazarus Serper</Person>

<ProvinceOrState count="4" relevance="0.578" normalized="Florida,United States">Florida</ProvinceOrState>

<City count="3" relevance="0.428" normalized="New York,New York,United States">NEW YORK</City>

<Facility count="2" relevance="0.165">Wall Street</Facility>

<NaturalFeature count="2" relevance="0.398">San Francisco Bay</NaturalFeature>

<ProvinceOrState count="2" relevance="0.305" normalized="Oregon,United States">Oregon</ProvinceOrState>

<Event count="2">Person Communication and Meetings</Event>

<City count="1" relevance="0.051" normalized="Danbury,Connecticut,United States">Danbury</City>

<City count="1" relevance="0.078" normalized="Brookline,Massachusetts,United States">Brookline</City>

<City count="1" relevance="0.058" normalized="Oceanside,New York,United States">Oceanside</City>

<City count="1" relevance="0.134" normalized="Pembroke Pines,Florida,United States">Pembroke Pines</City>

<City count="1" relevance="0.283" normalized="West Linn,Oregon,United States">West Linn</City>

<Company count="1" relevance="0.031" normalized="Associated Press">Associated Press</Company>

<Company count="1" relevance="0.286" normalized="Time Warner Inc.">CNN</Company>

<Country count="1" relevance="0.249" normalized="Switzerland">Switzerland</Country>

<Country count="1" relevance="0.249" normalized="Kenya">Kenya</Country>

<Country count="1" relevance="0.249" normalized="Cayman Islands">Cayman Islands</Country>

<Facility count="1" relevance="0.286">World Trade Center</Facility>

<Facility count="1" relevance="0.286">Hall of Fame</Facility>

<IndustryTerm count="1" relevance="0.104">Internet support group</IndustryTerm>

<MedicalCondition count="1" relevance="0.078">brain injury</MedicalCondition>

<Movie count="1" relevance="0.286">World Trade Center</Movie>

<NaturalFeature count="1" relevance="0.141">Long Island</NaturalFeature>

<Organization count="1" relevance="0.289">U.S. Bankruptcy Court</Organization>

<Person count="1" relevance="0.031">Sharon Cohen</Person>

<Person count="1" relevance="0.286">Sandy Koufax</Person>

<Person count="1" relevance="0.031">Jordan Robertson</Person>

<Person count="1" relevance="0.031">Neill Robertson</Person>

<Person count="1" relevance="0.031">Samantha Bomkamp</Person>

<Person count="1" relevance="0.297">ADAM GELLER</Person>

<Person count="1" relevance="0.286">John Malkovich</Person>

<Person count="1" relevance="0.279">Bernie Madoff</Person>

<Person count="1" relevance="0.031">Allen G. Breed</Person>

<Person count="1" relevance="0.286">Larry King</Person>

<Person count="1" relevance="0.141">Rita</Person>

<Person count="1" relevance="0.260">Mike</Person>

<Person count="1" relevance="0.083">Nancy Fineman</Person>

<Person count="1" relevance="0.286">Larry Silverstein</Person>

<ProvinceOrState count="1" relevance="0.058" normalized="New York,United States">New York</ProvinceOrState>

<ProvinceOrState count="1" relevance="0.078" normalized="Massachusetts,United States">Massachusetts</ProvinceOrState>

<ProvinceOrState count="1" relevance="0.051" normalized="Connecticut,United States">Connecticut</ProvinceOrState>

<Technology count="1" relevance="0.297">ADAM</Technology>

<Topics>

<Topic Score="0.403" Taxonomy="Calais">Business_Finance</Topic>

</Topics>

</CalaisSimpleOutputFormat></OpenCalaisSimple>

Think about it: With the push of a button, a person creating text in a word processor can generate semantically rich metadata in real time without leaving the word-processor environment, using no IT resources. And at no cost. (OpenCalais is free to all.) With very little work, the entire process could be made to happen 100% transparently. The author doesn't even have to know anything's happening over the wire. At check-in time, his or her document is already a good semantic citizen, by magic.

This is just one tiny example of what can be done with the technology. Maybe you can come up with others? If so, please keep me in the loop. I'm interested in knowing what you're doing with OpenCalais.

Retweet this.

OpenCalais: Metadata-as-a-Service


By now, you've probably heard of OpenCalais, the free (as in free) online metadata-extraction service. If you haven't heard of it (my God, man, where have you been?) you should drop what you're doing right now (yes, now; I'll wait) and go immediately to the OpenCalais web site and drink from the fire hydrant. This is a game-changer in the making. You owe it to yourself to be on this bus with both feet.

Basically, what we're dealing with here is Metadata-as-a-Service: OpenCalais is a text analytics web service (created by Thomson Reuters) with SOAP and REST APIs. It's "open" not in the sense of source code, but of open access. Anyone can use the service for any reason (commercial or personal). Calais point man Thomas Tague explains the motivations behind OpenCalais in a Rob McNealy podcast interview, remarkable for (among other things) Tague's surprising explanation of why a megalith like Reuters would offer such a powerful, valuable online service for free.

I've been experimenting with the API (more on that tomorrow) and I have to say, I'm impressed. You can query the OpenCalais service, sending it text in any of several formats, and receive metadata back in your choice of RDF, "text/simple", or Microformats (great if you're wanting big lists of rel-tags). The "text/simple" format is great for entity extraction, and I'll give source code for how to do that tomorrow.

Response data comes back in your choice of XML or JSON. So yeah, it means what you think it means: Text analytics is now the province of AJAX. And it's free. Mash away.

Like I say, I've been experimenting with the APIs, and I've come up with a fairly impressive (if I may say so) little demo, written in JavaScript, that will erase any residue of doubt in anyone's mind as to how powerful the Metadata-as-a-Service metaphor is. Return here tomorrow for the demo, with source code.

Meanwhile, Twitter me at @kasthomas if you have questions. And please retweet this.

Wednesday, February 04, 2009

A Semantic Web Crash Course

I finally found a wonderfully terse, easy-to-follow "executive summary" (not exactly short, though) explaining How to publish Linked Data on the Web: In other words, how to make your site Semantic-Web-ready.

If you've struggled with trying to visualize how the various pieces of the Semantic Web fit together (all the RDF-based standards, for example), and you still feel as though you aren't quite grasping the big picture, go read How to publish Linked Data on the Web. It'll bring you up to speed fast. The authors deserve special mention (join me in a polite round of applause, if you will):
Chris Bizer (Web-based Systems Group, Freie Universität Berlin, Germany)
Richard Cyganiak (Web-based Systems Group, Freie Universität Berlin, Germany)
Tom Heath (Knowledge Media Institute, The Open University, Milton Keynes, UK)
I hope vendors in the content-management space will get to work producing tools aimed at helping people implement "highly semantic sites" (tm). Search 2.0-and-up will rely heavily on linked data, and the advantages of a linked-data-driven Web in terms of enabling thousands of Web APIs to be conflated down to scores or hundreds will become apparent quickly once the ball starts rolling.

Tuesday, February 03, 2009

Why super( ) sucks

One complaint I heard someone make recently, in the context of JavaScript not having a true inheritance model, is that there is no super() in JavaScript. Somebody, in a forum somewhere, actually whined and moaned about not being able to call super(). I believe the whiner was a Java programmer.

There shouldn't be a super() in Java, either, though. That's the real issue.

I'm flabbergasted that anyone thinks super() is a meaningful thing to have to write, in any language. What could be more obscure and arcane than super()? It's totally cryptic. It's shorthand for "go invoke a method of my parent that I happen to have intimate knowledge of. Never mind the side effects, I'm clairvoyant enough to understand all that, even if my parent's concrete implementation changed without my knowing it."

I thought secret knowledge and hidden dependencies were supposed to be evil.

Monday, February 02, 2009

Twitter traffic still soaring


(Click on the graph for a larger version. Or go to quantcast for more.)

Thank goodness there's something in this economy that isn't slowing down.

Sunday, February 01, 2009

Inheritance as Antipattern

Allen Holub tells of once attending a Java user group meeting where James Gosling was the featured speaker. According to Holub, during the Q&A session, someone asked Gosling: "If you could do Java over again, what would you change?" Gosling replied: "I'd leave out classes."

Holub recalls: "After the laughter died down, he explained that the real problem wasn't classes per se, but rather implementation inheritance: the extends relationship."

I bring this story up because it seems a lot of people still think inheritance (supposedly the cornerstone of OOP) is good. Those same people want to impose the inheritance model on JavaScript. Which to me would be a terrible thing to do. I wouldn't go so far as to say inheritance is evil, even though many experts have indeed said exactly that. But it is certainly the most misused feature of Java. It ruins most otherwise-good APIs, I've found. (Google's Joshua Bloch has observed the same thing.) In the real world, inheritance tends to be an antipattern.

Inheritance violates encapsulation, undercutting the most basic of OOP principles.

Quite simply: Inheritance requires children to understand their parents (which I can tell you from personal experience is a dangerous assumption).

Subclassing leads to bloat (something Java needs more of...), because children inherit the methods of their entire ancestry chain. Which leads to things like JMenu having 433 methods.

It also locks new classes into preexisting concrete implementations, which introduces brittleness. A change in an ancestral method can break children unexpectedly. This is a well known drawback of inheritance.

Here is a verbatim quote from the Java API documentation for the Properties class:

Because Properties inherits from Hashtable, the put and putAll methods can be applied to a Properties object. Their use is strongly discouraged as they allow the caller to insert entries whose keys or values are not Strings. The setProperty method should be used instead. If the store or save method is called on a “compromised” Properties object that contains a non-String key or value, the call will fail.

This sort of thing has an odor about it. It reeks of poor design.

There's plenty more to be said on this subject, but it's been said elsewhere and I won't regurgitate needlessly. And again, I have to stress, I don't consider inheritance evil so much as misused. More on that some other time.

The thing that bothers me is that so many Java programmers who haven't taken the time to grok Brendan Eich's motivations for making JavaScript the way it is (drill into some of the links at this page to get a tiny taste of what I'm talking about) think JavaScript's compositionality-based prototype model is a flaw, or at the very least, an egregious oversight. Hardly. The langauge was designed that way for a reason.

Gosling, Eich, Bloch, Holub, all know what they're talking about. Inheritance is overrated.

Saturday, January 31, 2009

Script for bypassing Google's "site may harm your computer" page

There was an outbreak of the bogus "visiting this web site may harm your computer" warning-page redirection on Google this morning. Apparently there have been occurrences of this phenomenon before (judging from blogs going back to 2007). You run a search on Google, and all of a sudden every hit has a warning link under it that says "visiting this web site may harm your computer", and if you try to go to the page in question, you get directed to a Google warning page that urges you not to go to the actual page you want.

On Twitter, people began labelling the problem #GOOGLEMAYHARM, which of course is phonetically similar to GOOGLE MAYHEM.

Naturally, I went to work on a Greasemonkey script to fix the situation. And naturally, in the time it took me to write the script, Google fixed the silly redirection thing.

In any event, if you are seeing the "harmful site" warning, here's a Greasemonkey script that should allow you to bypass the Google redirection page:

// ==UserScript==
// @name GoogleHitFixer
// @namespace fixer
// @include http://www.google.com/*
// ==/UserScript==

// Routes around the bogus warning page that says
// "visiting this web site may harm your computer"

// Public domain. Author: Kas Thomas

( function main( ) {

var signature = "interstitial?url";

var address = location.toString( );

if ( address.indexOf( signature ) == -1 )
return;

var newUrl = address.split( "?url=" )[1];

location.href = newUrl;

} )( );

Friday, January 30, 2009

"Crux" app wins JCR Cup

Day Software announced the winner of the JCR Cup 08 competition today. College sophomore Russell Toris won top prize (taking home a MacBook Pro) with a little web app called "Crux" (a shameless play on CRX, which is Day's commercial Java Content Repository).

I managed to learn a tiny bit more about Crux. And from what I've seen, it is indeed a clever use of JSR-170 technology.

What it lets you do is copy and paste arbitrary selections from any web page that's open in your browser, and save them straight to a JSR-170 repository (in this case, Day CRX, which is built atop Apache Jackrabbit). When you want to retrieve the selection(s) again, you can browse the repository and open them again in your browser.

Why is this useful? Here's the use case. Suppose you've got a dozen tabs open in Firefox (because you're researching a term paper) and you want to save references to the various content items you've been looking at. The conventional thing to do is bookmark all the open pages. But the problem with bookmarks is that they don't actually encapsulate any content from the pages you were on: They just encapsulate URLs and page titles (which are often meaningless).

With Crux, you highlight and Copy content selections from pages, then push those items into the repository with the click of a button. (Of course, you have to have a repository server running somewhere, reachable via HTTP.) When you want the clipped items again, you visit one URL (the node in the repository where the items are stored), and there are all your snippets, viewable in a single summary page. And they render nicely since Crux saves actual selection-source markup, not just raw text. Any embedded links, images, etc., in the clipped content are still there. Also, each entry in Crux contains a trackback link to the original source page, in case you really do need to go back to the page in question.

If you think about it, saving content clippings is actually a very compelling alternative to bookmarking. A bookmark is just an address. What you care about is the content, not the address. I have hundreds of bookmarks already. I can't keep them straight. They just keep piling up, and I can't remember what most of them are for. (Even the ones I use a lot, I sometimes have trouble finding again.) Crux provides a useful alternative.

How do you find something in the repository after you've pushed hundreds of content items into it with Crux? You use whatever repository search tools you'd normally use. Only this time, you can actually run full text searches on the content items you stored, rather searching page names in your Bookmarks collection.

Functionality similar to Crux is available via Clipmarks. Also, Microsoft tries to do some of this with its Onfolio and OneNote products (which are, IMHO, painfully klutzy). Crux looks and feels very light and simple. It definitely hits a sweet spot.

Whether Crux's source code will ever see the light of day, I don't know. (Entrants in the JCR Cup competition were not required to make source code public.) Reportedly, the code is all JavaScript and requires Greasemonkey.

In any event, congratulations Russell Toris! And kudos to Day for sponsoring the competition. It's nice to see JCR being used for something practical, lightweight, and simple. Well done.

Google Measurement Labs?

Google has introduced yet another service, called Google Measurement Labs, designed to test your connection speed and provide various types of information about your last-mile chokepoints.

I have read Google's own announcement about this as well as several blogs that try to explain it, and honestly, I still can't fathom the true motivation(s) behind it or why the heck anyone outside of academia (or perhaps the NSA) would even care. Obviously, Google has an interest in last-mile problems (the Internet is its lifeblood), but offering this set of diagnostics to the general public gives the impression that Google is very proudly answering a question nobody asked.

I don't get it.

Wednesday, January 28, 2009

The energy cost of SSL

I just finished reading a paper called The Energy Cost of SSL in Deeply Embedded Systems, by Sun Microsystems researchers Vipul Gupta and Michael Wurm. Fascinating stuff.

It turns out that secure communication over SSL shortens battery life by approximately 15% in very small (mote-like) wireless devices that use SSL. The size of such devices (commonly used as sensors in manufacturing, but soon to be all around us, if you believe the sci-fi hype) makes them extraordinarily sensitive to anything that draws electrical current, including computation. In a mote, it's not uncommon for 5% of the available energy from a pair of alkaline batteries to be consumed by SSL handshakes, 10% by polling, 25% by SSL data transfer, and the remaining 60% by the device itself. Those ratios will be different for non-secure (non-SSL) data transfer. If you do the apples-to-apples energy balance, the SSL mote pays an energy penalty of 15%, overall, for security.

The authors of the paper don't discuss things like efficient versus inefficient implementations (in assembly language) of handshake algorithms (such as Elliptic Curve), but obviously a poor implementation could significantly affect performance. An unfriendly chip architecture could affect things too. The authors do mention that the particular chip they used (TI MSP430) "offers a rotate instruction which speeds up SHA1 and MD5 by almost 40%."

Motes aren't ubiquitous yet, but hopefully by the time they are, they'll be powered by something other than batteries (e.g., ambient light), so that we don't have to worry about SSL causing even more zinc and manganese to enter the environment when worn-out mote batteries find their way into landfills. Imagine that: SSL as an environmental threat . . .

Tuesday, January 27, 2009

Microsoft aims to patent CSS extensions

I came across an interesting patent application from Microsoft (published 15 January 2009) called Extended Cascading Style Sheets in which Microsoft extols the virtues of something called CSSX (which I suppose means CSS Extensions). From the Abstract:
A CSSX (Extended Cascading Style Sheets) file including non-CSS (Cascading Style Sheet) extensions is used to define and reference variables and inheritance sets. A CSSX file compiler determines a value of the defined variable, modifies the CSSX file by replacing all references to the defined variable with the value, and generates the CSS file from the modified CSSX file. The inheritance set is defined in the CSSX file and includes a reference to a previously defined CSS rule set. The CSSX file compiler defines a new CSS rule set as a function of the determined attributes included in the previously defined CSS rule set of the defined inheritance set and generates the CSS file including the newly defined CSS rule set.
From what I can tell, Microsoft is proposing adding #defines (and other precompiler-looking stuff) to Cascading Style Sheets so that a last-minute "compile pass" on the server will generate CSS of the correct flavor for a given page request (correct as to localization, reading direction, accessibility, etc.) -- all done dynamically, just in time. The intent is clearly to eliminate the need for webmasters and others to create and manage multiple hard-coded flavors of the same stylesheet. In fact, CSSX aims to make CSS more compositional all the way around. (The patent talks about introducing new inheritance notions into CSS, for example.)

Of course, there are drawbacks to consider. CSSX is not as easy to read or maintain as CSS (but I suppose if your development tools are good enough, this won't matter so much). CSSX is more verbose than CSS. It's doubtless harder to QA-test. But the main drawback, I think, is that it tends to mix presentation logic with non-presentation logic. That's a dangerous place to go.

Unfortunately, Microsoft wants to patent CSSX when it should actually be working with a standards body on it. Does the world really need another proprietary "standard" from Redmond, at this point? What's the point in extending a standard, then trying to patent it?

That part seems really, really stupid to me.

Sunday, January 25, 2009

Most Google employee options are under water

I didn't listen to the recent Google conference call, but according to someone who did, 85% of Google employee stock options are now under water. That's got to put a damper on "company spirit" for everyday workers. I've been in this situation myself (i.e., working for a company where everybody has options, but the options are hopelessly far below the strike price). It is a dreadful feeling, especially if the company you work for has a great history, a great culture, and brilliantly engineered products that should be doing much better in the market than they are.

Everyone knows, of course, that there is no guarantee a company's stock price will go up over time, and most employees are mature about this realization. But it still hurts. Under-water options hurt.

Knowing full well that this kind of thing saps employee enthusiasm and causes the wrong kind of water-cooler conversation, Google last week announced a new option-repricing plan for employees. The features of the plan:
  • It is a one-for-one, voluntary exchange.
  • The offer period begins on January 29, 2009 and ends at 6:00 a.m. Pacific Time on March 3, 2009, unless Google is required or opts to extend the offer period.
  • Employees will be able to exchange their under-water options for new options with a strike price equal to the closing price of Google stock on March 2, 2009.
  • The new options will have a new vesting schedule that adds 12 months to the original vesting schedule.
As it turns out, a company I worked for did this same thing. It offered employees the chance to roll over their existing options into new ones based on a new (current) strike price. But the vesting date moved out. If you were almost-vested (perhaps already vested) in your worthless options, you lost your vesting.

The problem with resetting the clock, of course, is that if the stock keeps sinking, you're still screwed. Also, if you have to be an employee in order to see your options continue to vest, who's to say you'll still be working for the company in a year?

Options have expiration dates. The company I worked for set a shorter expiration date for the new options (in this plan) than the original options had. So the time window for you to see a gain was narrowed. I don't know if that's the case with the new Google plan.

Bottom line, options (as an employee incentive) are tricky. In good times, they do work as an incentive. In bad times, they work as a disincentive (from what I've witnessed). Repricing plans don't always work out. (In the case of the company I worked for, it did not work to the employees' benefit.) In fact, repricing plans generally tend to favor the company, in one way or another. I believe that's the case here. Otherwise, I don't think Google would offer the plan at all.

Friday, January 23, 2009

JSON beautifier

The other day, I wanted to take a look at my Firefox bookmarks file. I could have exported my bookmarks to an HTML file using the Organize Bookmarks dialog, but instead I wanted to just use the existing bookmarks file (the private copy Firefox already uses). It turns out Firefox keeps archives of your bookmarks in

C:\Documents and Settings\[USER]\Application Data\Mozilla\Firefox\Profiles\bookmarkbackups

(on Windows)

and they are formatted as JSON! Trouble is, the JSON text has no newlines or tabs or other spacing, so if you open the bookmarks file in Notepad, you'll see One Big Huge Line of unformatted text.

Unformatted JSON is ugly. But fortunately, there's an answer.

Over at http://archive.dojotoolkit.org/nightly/dojotoolkit/dojox/gfx/demos/beautify.html there's an online form that will beautify (pretty-print to your screen) any raw JSON that you paste into the form. It does an exceptionally nice job. Give it a try if you have a need to reformat JSON source.

Tuesday, January 20, 2009

What politicians and company blogs have in common

Forrester Research, in a report called Time To Rethink Your Corporate Blogging Ideas, has confirmed what some of us have long suspected, which is that company blogs are viewed with distrust by the overwhelming majority of people who read them.

Josh Bernoff's research found that out of 18 different possible sources of information (ranging from personal e-mails to newspapers and TV to wikis and online classifieds), corporate blogs rank at the very bottom of the trust scale (18th place), with only 16% of people who read them saying that they trust them.

By comparison, 15% of Americans say they trust politicians (ref).

I was able to download a free copy of the $279 Forrester report at http://www.forrester.com/imagesV2/uplmisc/Josh_blogging.pdf. Hopefully the link will still work when you go there.

Monday, January 19, 2009

Data retrieval resource list

I came across this web page that contains a large and interesting list of online resources pertaining to information retrieval and search technologies. It includes books, courses, research articles, SEO tips, and much more.

Scroll down to get to the go0d stuff.

Saturday, January 17, 2009

Dr. Dobbs is (un)dead

This is an incredibly sad day for me.

Dr. Dobbs Journal, one of the great programming resources of the late DOS/early Windows era, has finally died, a victim (ironically) of the Internet's triumph over pulp-and-ink.

The venerable programmer's magazine hasn't exactly gone away entirely: It will (somewhat sadly) continue as "Dr. Dobbs Report — A Special Software Development Monthly Section in InformationWeek Magazine."

But that, too, has the smell of death about it.

To say that I owe a lot to DDJ is an understatement. DDJ was a critical part of my programming education. Allen Holub's early DDJ articles on the newfangled C language taught me a huge amount about programming and profoundly influenced my development as a coder. (Eventually, in 1991, I even wrote an article myself for DDJ.)

It's a sad thing, this disappearance of the printed word, this seemingly unstoppable deprecation of protons and neutrons. Magazines, newspapers, books, music CDs -- all on the endangered species list. Is all of human culture destined to be disseminated by coax cable and microwave radiation?

If you'll excuse me, I have to be alone right now.

Friday, January 16, 2009

The carbon cost of a Google search

Thursday, January 15, 2009

Adware author tells all

There's a truly fascinating interview with Ruby/Lisp/Scheme/C programmer (and onetime adware creator) Matt Knox over at http://philosecurity.org. Anybody who has always wondered how adware works, and why it's so infuriatingly difficult to get rid of, needs to read that interview.

It so happens, I recently spent several hours ridding my son's machine of a particularly nasty adware furball. I was able to eradicate most of it, but there were some peculiar registry entries I couldn't get rid of no matter how I tried. Immutable registry entries.

Now I know why such entries can exist.

Matt Knox explains how, in his days working for Direct Revenue (the firm Eliot Spitzer sued a couple years ago, for -- ahem -- propagating Trojans), he created unwritable registry keys by exploiting a little-known difference between the Win32 API and the NT API. "Windows, ever since XP, is fundamentally built on top of the NT kernel," Matt Knox explains. "NT is fundamentally a Unicode system, so all the strings internally are 16-bit Unicode. The Win32 API is fundamentally ASCII. There are strings that you can express in 16-bit counted Unicode that you can’t express in ASCII." (Um, yeah: A Unicode string can contain 16-bit values in which the top 8 bits are zeros. In C, strings are null-terminated, so a Unicode string containing what appear to be null bytes might appear truncated to a process that was not expecting Unicode. )

Matt continues: "That meant that we could, for instance, write a Registry key that had a null in the middle of it. Since the user interface is based on the Win32 API, people would be able to see the key, but they wouldn’t be able to interact with it because when they asked for the key by name, they would be asking for the null-terminated one."

This is just one example (cited by Knox) of the countless Microsoft design weirdnesses that have led to the tragic security mess that is Windows. This sort of thing is why the Spybot database now contains almost a half a million entries, and also why Norton security updates (and Windows updates) will soon be eating 99 percent of available CPU cycles from machines connected to the Internet. And if you read between the lines of Matt Knox's interview, you'll understand that the mischief is really only just beginning.

Take my advice. Read the interview. It's an eye-opener.

Wednesday, January 14, 2009

Making server outages scalable

I mentioned not long ago the server outages that ensued when Windows 7 beta downloads became available. This is not supposed to happen in the world of cloud computing, of course. You're not supposed to be able to bring down "the cloud."

But there've been a number of high-profile cloud failures. Just in the past 90 days:
Question: Where are you supposed to stand when the sky is falling?

Tuesday, January 13, 2009

Monday, January 12, 2009

The stampede away from Vista accelerates

An editor from a well-known "information technology" publication recently asked me to name some tech trends that I thought would be important this year. I told him the stampede to Windows 7 would break Richter scales around the globe and possibly affect the earth's rotation.

Looks like the madness has already begun.

Friday, January 09, 2009

Most Google products make no money

There's a poignant table at Google Blogoscoped that gives a detailed breakdown of 87 Google "products" and services, with an explanation of how they work and what they cost.

The interesting part is that only about 20 of the 87 have an associated revenue model. True, you only need one good one (one really profitable product). But still, why so much entropy?


Thursday, January 08, 2009

No DAM middle ground

Yesterday, eWeek interviewed me for a story, "Midmarket Digital Asset Management Firms Disappearing." The thrust of it is that there are damn few DAM solutions for small to midsize businesses, specifically businesses that need a solution that's scalable and plays well with typical SMB IT infrastructure, all for under $100K. The situation is somewhat odd, given that there are tons of CMS vendors in the $10K to $50K range (plus open-source offerings). In the DAM world there aren't even any serious open-source contenders.

I think there's an opportunity here (arguably) for someone to build a powerful but affordable DAM application that will run atop (or alongside?) an open-source CMS.

Failing that, I'd be happy if someone would just build an Adobe Bridge-to-Alfresco connector.

Monday, January 05, 2009

When Certs Collide



A couple months ago, I mentioned that some Russians had cracked WiFi WPA2 security using a GeForce 8800 graphics processor. I also speculated on what a determined person might be able to do with the fearsome power of multiple Sony PS3 machines networked together.

Now we know. You can hack MD5 security. It's been done: Researchers Jake Appelbaum, Arjen Lenstra, David Molnar, Dag Arne Osvik, Alex Sotirov, Marc Stevens, and Benne de Weger successfully used 200 PlayStation 3s (see photo, above) to craft a rogue Certification Authority certificate, based on finding hash collisions in MD5-space. The 40-slide deck describing the work is available here.

According to the researchers, 200 PlayStations is roughly equivalent to 8000 desktop PCs, and the processing power needed to crack a cert based on 128-bit MD5 would require $20K of Amazon cloud time.

Crafting a rogue CA cert means (essentially) the crackers were able to convey Cert Authority status on themselves. What's hilarious is that the bogus cert contains no revocation URL and thus can't (easily) be revoked! For demo purposes, the hackers back-dated their cert to August 2004. A malicious hacker could create a cert that never expires.

After you read the slide deck, you won't know whether to laugh or cry. I did both.

Thursday, December 11, 2008

LinkedIn Flex group hits membership limit

From Adobe Technical Evangelist Ben Forta comes word that the LinkedIn Flex Developers Group is now full to capacity and can accept no more members.

Apparently, LinkedIn groups can have a maximum of 3000 members, and that's how many the Flex-dev group now has. (I wonder who the genius was that hard-coded that limit?) Forta says he is sitting on 50-some-odd requests to join, and can't approve them. Moreover, he pinged LinkedIn Customer Services to ask if there was a way to raise that limit. He was told he can't approve any more requests. "If the limit gets raised," Forta says, "I'll let you know (and will approve those in the queue)."

Tuesday, December 09, 2008

SlingPostServlet demystified

One of the neatest things about Apache Sling (the JCR-based application framework) is its easy-to-use REST API, which allows you to do CRUD operations against a Java Content Repository using ordinary HTML web forms (or should I say, ordinary HTTP POST and GET), among many other interesting capabilities. The magic happens by way of a class called SlingPostServlet. Understanding that class is key, if you want to leverage the power of Sling without writing actual Java code.

Turns out there's an exceptionally thorough (and readable) discussion of the many capabilities of the SlingPostServlet at the Sling incubation area of Apache.org. You can think of it as the fully exploded version of Lars Trieloff's Cheat Sheet for Sling (an excellent resource). It's the next best thing to reading the source code.

Monday, December 08, 2008

MS Office apps as services

According to Information Week, Tibco and OpenSpan have "teamed up to make parts of Microsoft's Office applications available as services for inclusion in an enterprise service-oriented architecture."

OpenSpan has (in the words of a rather breathless reporter) "demonstrated that it's possible to generate mashups of Microsoft Office applications without changing the underlying application code."

On a superficial level, this is the kind of thing that people do routinely with OpenOffice running in server mode. (Various content management systems use OO to do document transformations on the server. The OO developer documentation shows how to set this up.)

I gather the OpenSpan stuff has tooling for making it easy to create Office-service mashups. It's a good idea and I wish such tooling existed for OpenOffice. It'll be interesting to see if Tibco and OpenSpan score any OEM deals with CMS vendors.

Friday, December 05, 2008

How OSGi changed one person's life

Peter Kriens has written a really nice article for ACM Queue called How OSGi Changed My Life. Go here to read it online. It's a high-level overview for people who are still trying to grok the whole OSGi phenomenon.

OSGi is a game-changing technology, IMHO, because it brings familiar SOA precepts to ordinary POJO programming. (How's that for acronym abuse?) POJOs end up having fewer unwanted intimacies, and if you run them inside Spring inside OSGi, the POJOs don't have to know so much about the runtime framework, either. Compositionality is greatly facilitated in OSGi; the level of abstraction is high; the benefits are numerous and far-reaching. I see OSGi as revitalizing Java programming for enterprise.

Good tooling for OSGi is still scarce. (Doing "Hello World!" is much harder than it should be.) I suspect that will change very soon, though. Meanwhile, OSGi is quite pervasive already (it's in quite a few products, though seldom advertised), and I look for 2009 to be the year when OSGi finally goes double-platinum.

Tuesday, December 02, 2008

Exception-throwing antipatterns

Tim McCune has written an interesting article called Exception-Handling Antipatterns, at http://today.java.net (a fine place to find articles of this kind, BTW). The comments at the end of the article are every bit as stimulating as the article itself.

McCune lists a number of patterns that (I find) are very widely used (nearly universal, in fact) in Java programming, such as log-and-rethrow, catch-and-ignore, and catch-and-return-null; all considered evil by McCune. My comment is: If those are antipatterns, the mere fact that such idioms are so ubiquitous in real-world Java code says more about the language than it does about programmers.

I've always had a love-hate relationship with the exception mechanism. On the whole, I think it is overused and overrated, at least in the Java world, where people seem to get a little nutty about inventing (and sublcassing) custom exceptions and ways to handle them, when they should probably spend that energy writing better code to begin with.

Monday, December 01, 2008

Get paid for being job-interviewed

At last, an online service that deters head hunters from pestering me.

The unusual promise made by NotchUp is that potential employers who want to contact you directly (avoinding the expensive services of a professional placement agency) will actually pay you to agree to an interview. All you have to do is sign up with NotchUp, and wait for the phone to ring. (And wait. And wait.)

How much will you get paid? The NotchUp folks have put a fee calculator on their site. It shows that an IT professional with 10 years of experience can expect to receive $380 per interview.

NotchUp membership is free to interviewees, but you have to go through an application process and be accepted. Which already sounds fishy to me.