Saturday, February 28, 2009

Python, Twitter, and AI

Lately, I've become a bit of an AI fan -- and by that I don't mean Artifical Intelligence, but Amy Iris, whose blog I added to the Blogroll today.

If you spend a lot of time on Twitter, and especially if you read/write/eat/dream Python, you'll probably find Amy Iris's blog interesting. It's a treasure-trove of thoughtful social-web musings accompanied by a lot of really great Python sample code and real-world data from AI's experiments in Twitter-hacking. Amy is extraordinarily generous with ideas, code, and insights, and the blog is refreshingly personal. I've benefitted enormously from it. And I don't even speak a word of Python.

[Disclosure: I don't know Amy Iris, and I have no idea what her affiliations or background might be, or whether "Amy Iris" might in fact be a pseudonym, or an anagram ("is my AIR"). All I know is what I read in the blogs and tweets, and so far, it's all good. Very good.]

Friday, February 27, 2009

Why is enterprise software so bad?

Matt Asay blogged the other day (as did Michael Nygard) on why enterprise software is so shockingly bad. I couldn't help but smile sadly, nod my head, and wipe a virtual tear from my cheek.

Obviously not everything in the world of enterprise software is poorly done (there are some exciting things going on out there, after all), but the fact is, you and I know exactly what people mean when they say that enterprise software sucks.

Enterprise software is typically big, slow, and fugly, for starters. The GUI (if there is one) is often a usability disaster. Sometimes there are strange functionality holes, or things don't work the way you'd expect. And of course, a lot of the time the software is just plain unstable and ends up detonating.

I've worked in R&D for two software companies, one large and one small. Both companies made Java enterprise software, and rest assured, we shipped our share of chunkblowers.

In one case, we had created a sizable collection of connectors (priced around $15K each) designed to let our customers integrate with popular products from SAP, JD Edwards, Lawson, PeopleSoft, Siebel, IBM, and others. I took classroom training on some of the "remote systems." And I was mortified by how inferior the user-facing pieces of some very expensive middleware products are, compared to the desktop software I use every day.

When I first arrived in the enterprise software world (a wide-eyed noob), it was a shock to the senses. I noticed a number of things right off the bat:

1. Data (and in particular, data integrity) matters more than anything else. You can throw exceptions all day long. Just don't lose any data.

2. Getting the job done is top priority. How you get it done doesn't matter as much as getting it done. If mission-critical software somehow manages to "do the job" (no matter how poorly), life goes on. If not, the world comes to an end.

3. Most user interfaces are designed by developers. (Which is kind of like letting welders do plastic surgery.)

4. Usability testing (if it happens at all) happens near the end of the development cycle, after it's too late.

5. Customers, alas, don't know any more about good UI design than you do. They can only tell you what doesn't work.

6. An easy installation experience is not perceived as having anything to do with solving business problems (it's orthogonal), hence it's not a priority, hence software installation and setup tends to be a brutally punishing experience.

7. "Interoperability" and "standards friendly" means putting everything in XML (using your own inscrutable custom schema that no one else uses), then describing all of your one-off proprietary interfaces in WSDL so you can tell your customers you have a Web Services API.

8. If your customers aren't finding the software easy to use, it's because they didn't get enough training.

9. If the software is too slow, it's because you need a bigger machine.

10. Frequently heard phrases: "There's a fixpack coming." "Did you look in the logs?" "You're the first person to report this."

In a macro sense, enterprise software ends up being disappointing for two main reasons, I think. First, the process surrounding enterprise-software procurement and deployment is typically somewhat thick, involving large numbers of stakeholders and a fair amount of bureaucracy. The more bureacracy there is, and the more people that get involved, the greater the likelihood of a failed exercise in groupthink. A lot of really poor decisions get made by well-meaning people working together in large committees. Bottom line: a flawed procurement process leads to situations where "all the checkboxes are checked" yet no one is happy.

The second thing is, making a good software product is hard. It requires extra effort. And that means extra cost. Manufacturers don't like extra costs. So there's a substantial built-in incentive to turn out software that's "good enough" (and no better).

McDonalds didn't get to be the most successful company in the fast-food business (the IBM of food, if you will) by producing great food. It's important to be clear on that. They produce food that's good enough, and they do it at a price point that hits a sweet spot. (And they have a delivery mechanism that allows customers to consume the product the way they want to, such as in the car on the way to someplace else.) Secret sauce has little to do with any of it.

I do think things are getting better. Enterprise software doesn't suck as much as it did ten years ago (or heaven forbid, twenty years ago). The pace of change has certainly quickened. Iterations are faster. Competitive pressures are higher. And customer expectations are rising.

It's still pretty bad out there, though. Which is either a curse or an opportunity, depending on where you sit and how you want to look at it.

Thursday, February 26, 2009

Stupid console tricks: Getting the names of your Twitter non-followers

If you follow a lot of people on Twitter, but some of them don't follow you back, and you'd like to know who, exactly, those insensitive, churlish dolts are that don't find your tweets amusing, I have a couple of practical tips for you (with a quick JavaScript lesson thrown in for good measure). Ready?

First tip: Go to http://friendorfollow.com and enter your Twitter name in the box. When you click the magic button, you'll see a new page appear, with the little thumbnail headshots of all your non-followers lined up in a grid. A veritable wall of infamy.

How do you harvest the usernames of these folks? Unfortunately, friendorfollow.com doesn't seem to have an Export button (unless I'm missing something). There doesn't seem to be any easy way to capture those names.

Not to worry, though. You know JavaScript.

If you're running Firefox, flip open the Firebug console (install Firebug first, if you haven't done so already; how can you live without it?) Copy and paste the following code to the console:

// convert DOM to string

markup =
(new XMLSerializer).serializeToString( document.body );

root = new XML ( markup ); // convert string to XML

users = root..A.IMG.@alt; // magic E4X expression

for ( var i = 0; i < users.length( ); i++ )
console.log( users[i].toString( ) );


When you run these lines of code, all the usernames of the folks whose thumbnails are shown on the page will be written to the "output" side of the Firebug console.

Let's step through the code. The first line of code creates an XMLSerializer object (Mozilla API) and uses it to serialize the DOM starting with the 'body' node. We need to use an XMLSerializer here rather than just fetch the markup from innerHTML, because we don't want to have gnarly ill-formed HTML in the next step, lest we puke and die.

With our pristine, "tidy" (if you will) markup, we create a new E4X XML object out of it via the XML constructor and assign the result to the cleverly named variable "root."

In the highly magical third line of code, we use E4X notation to suck out all the descendent A elements (however many levels deep) under root that also have an immediate child of IMG with an attribute of "alt". The information we want (the username) is in the "alt" attribute of the IMG.

Note: 'A' and 'IMG' are capitalized because the "tidying" process that occurs in step No. 1 results in capitalization of all HTML element names. This is an important bit of canonicalization, since XML is case-sensitive.

The for-loop simply peels through our 'users' list and writes each name to the Firebug console using Firebug's console.log method. Note that the length of 'users' has to be obtained through a method call to length() rather than via direct property lookup, because 'users' is not an ordinary JavaScript array. It's an E4X node list. You have to use the length() method.

With not much extra work, you could wrap all the user data up in html tags and build a new page on the fly, and make Firefox render it in the current window, if you wanted to. Or you can just print out the thumbnail images on a piece of paper, stick it to the wall, and find some darts.

The choice is yours.

Tuesday, February 24, 2009

Adobe asks laid-off programmers to try Flex

Adobe, it seems, has a stimulus package of its own.

I've noticed that more than one Adobe Systems "technology evangelist" has been trying to entice the jobless on Twitter to try Flex Builder for free.

The tweets say: "Recently laid off? Looking to expand skills? DM me with your email address for a free license of Flex Builder."

The message has been circulated by @tpryan, @ryanstewart, @sjespers, and @ddura, all Adobe Flash Platform Evangelists. Variations of it have been circulated by other Adobe evangelistas. To wit:

@duanechaos took this approach:
Anyone want a free Flex Builder License? Applies to out of work developers only as part of an Adobe Stimulus package. Contact me offline.
@ashorten tweeted this:
if you know someone who's unemployed who'd like to get skilled on Flex then tell them to email shorten at adobe.com and I'll help
@mcorlan, an Adobe evangelist in Romania, posted:
If you have been laid of and are looking to learn Flex, please email or DM me.
Disclosure: I retweeted one of the original messages myself the first time I saw it, thinking it was just one nice Adobe guy's act of kindness. Then it turned out to be two nice Adobe guys. Then three nice Adobe guys. Then four. Then five. Then six.

I don't know. I guess it's good that Adobe lets evangelists comp people on Flex Builder; and I guess it's good that Flexvangelists reach out this way to the recently-laid-off; but to exploit Twitter as an avenue for finding out-of-work programmers who might want to convert to the Flex religion? I don't know. I don't know.

Something about it creeps me out.

The RIA wars are over (and nobody won)

It seems Adobe CFO Mark Garrett, in a Tourette's-like outburst, said recently at an industry event that Microsoft Silverlight adoption has "really fizzled out in the last 6 to 9 months."

Microsoft's Tim Sneath (director of the Windows and Silverlight technical evangelism team) answered Garrett's charge in a defensive blog ,where he listed a number of prominent (marquee) customers using Silverlight in production and made the obligatory claim of 100 million downloads. ("For starters, Silverlight 2 shipped four months ago," Sneath said, "and in just the first month of its availability, we saw over 100 million successful installations just on consumer machines.")

That makes it official: All three major contenders in the RIA-development space (Microsoft's Silverlight, Adobe AIR, and Sun's JavaFX) have now claimed 100 million downloads in X number of days.

Adobe has claimed it. Microsoft has claimed it (see above), and a couple weeks ago Jonathan Schwartz of Sun made the claim for JavaFX.

I don't know why the number 100 million is so magical. It is certainly an interesting number. It's more than the populations of Spain, Syria, and Canada combined. It's one in every 67 people on earth. Think of all those starving babies in Africa who've downloaded JavaFX. Remarkable, isn't it?

What does it say when the top three RIA-framework contenders begin flaming each other in public debates over patently ridiculous "adoption rate" numbers? (I say ridiculous because the so-called "downloads" are actually based on stealth installs. Adobe, for example, Trojans AIR into Acrobat Reader and other product installers. Sun and Microsoft are guilty of the same tactics.)

What it tells me is that there's no winner in this so-called race. Which, in turn, tells me there's been no substantial market uptake.

Any time a successful product becomes a success in the market, it (by definition) penetrates and dominates the market in question, marginalizing the competition. In beverages, there's Coke (big winner), then Pepsi (also-ran), then the long tail. In almost every market, it's that way: There's a dominator, an also-ran, and a long tail.

We have not reached that point yet in the three-way RIA race. We may never reach it, because it's not obvious (to me, at least) that the software industry has embraced RIA, as currently conceived, in any major way, and it might not, ever. (Try this simple test. Ask ten people: "What's the killer RIA app of all time?" See if you get any agreement -- or anything but blank stares.)

It may actually be that the RIA Wars are over, and nobody won.

In fact, if credibility is any indication, everybody lost.

Monday, February 23, 2009

All thick clients should be Mozilla-based

There's a famous quote by Mark Twain: "When I was a boy of 14, my father was so ignorant I could hardly stand to have the old man around. But when I got to be 21, I was astonished at how much the old man had learned in seven years."

That pretty much sums up my attitudes regarding Mozilla-the-dev-platform. (Not Mozilla the browser infrastructure. Mozilla the runtime framework.) I've been programming for about 21 years, and when I first encountered Mozilla-the-dev-framework 7 years ago (via the 2002 book, Creating Applications with Mozilla) I was horribly unimpressed. I thought, who'd want to put up with programming that way? It all seemed so overfactored, so byzantine. XUL, XBL ... WTH??

But a funny thing happened. Like a lot of people who learned to program in the 1980s, I went through a learning curve on things like separating presentation from application logic, the usefulness of metadata, the importance of security-sandboxing, and an ability to program in multiple languages (including scripting languages) in a machine and OS-independent manner; and the advantages of a "packaged" app (artifacts and logic in separate pieces rolled up inside a deployable bundle), to say nothing of a plug-in architecture with well-defined mechanisms for installing and uninstalling executables.

When Eclipse first appeared, Eclipse sounded (to me, at the time) like everything Mozilla-the-application-dev-platform should have been. I totally drank the Eclipse Kool-Aid. And I was totally wrong. Eclipse was and is a terrible platform for creating non-IDE thick client apps: heavy, slow, awkwardly designed APIs, high learning curve, tied to a language, tied to a particular presentation technology, no HTML rendering engine, etc. etc. No amount of lip gloss will cover its flaws.

I haven't quite come full circle yet on Mozilla (I'm 90% there), but after taking another look at it (I stumbled upon this slide show by Brian King, which got me interested again), I am starting to think Mozilla is the most attractive option for thick-client development -- by far. It won't serve every need and shouldn't be worn as a straightjacket if more suitable attire is available, but for the 80% use-case I think it deserves more respect than it gets.

Long story short: If I were doing RIA development and needed a way to create relatively secure web-savvy desktop apps without having to learn a bunch of new legacy technologies with names like AIR, JavaFX, or Silverlight, the Mozilla framework is where I'd start. It's mature, it's built on standard web technologies, and it's ready to go.

Oh, and it also has no hype machine propping it up. Maybe that's what's holding it back?

Saturday, February 21, 2009

How CAPTCHAs can be beaten

I have always wondered how easy or hard it might be to write a program that can crack CAPTCHAs (those messed-up-text thingies you have to decipher in order to leave a comment on certain blogs or get a new account on Gmail, such as shown here). Two researchers give a great account of how they cracked Yahoo's Gimpy CAPTCHA system at http://www.cs.sfu.ca/~mori/research/gimpy/. Their algorithm works over 90 percent of the time. I don't know how old their work is or whether Yahoo has changed its CAPTCHA system in the interim. But it is interesting work nonetheless.

It is interesting to reflect on the fact that even a CAPTCHA-cracking algorithm with a dismal success rate (say 10 percent) is still plenty good enough for spammers who need to be able to create bogus e-mail accounts by the thousands via crackbots.

Judging from the amount of spam that gets by my spam filters every day, I'm pretty sure professional spammers are creating better and better CAPTCHA-defeating algorithms every day.

Friday, February 20, 2009

Continuous deployment vs. old-school QA

Timothy Fitz of IMVU has written an excellent piece on Fail Fast methodology (or "continuous deployment," in this case), explaining the benefits of putting changes into production immediately and continuously, which (in IMVU's case) does not mean nightly builds. It means several times an hour.

The main intuition here (I'll greatly oversimplify for the sake of claraity) is that you have much greater chance of isolating the line of code that caused your build to break if you publish the build every time you change a line of code.

That sounds at once obvious and terrifying, of course, but it makes sense. And it works for IMVU, which takes in a million dollars a month serving avatars and virtual goods to several hundred thousand active users and another ten million or so occasional users.

Of course, if you have very few users, serving builds super-frequently doesn't guarantee you'll find out about bugs quickly. And if you change lots of code between 30-minute publishing cycles (or whatever interval it turns out to be), you could end up with a real troubleshooting mess, although even in that case, you'd know immediately which build to roll back to in order to get customers back to well-behaved software.

Continuous deployment doesn't guarantee good design, of course, and it's not a QA panacea. It won't keep you from introducing code or design patterns that fail on scale-out, for example. But it's still an interesting concept. More so when you consider it's not just theory: A very successful high-traffic site is built on this methodology.

Fitz's original post, incidentally (as well as his followup post), drew a ton of responses. Many of the comments on the original post were negative, explaining why Fail Fast was dangerous or wouldn't work in all situations, etc. (totally ignoring the fact that it works very well for IMVU). Comments on his followup post were much less cry-baby, much better reasoned.

Fitz as much as says, straight-out, that unit testing is overrated (which I totally agree with). Automated testing in general gets short shrift from Fitz. He notes wryly: "No automated tests are as brutal, random, malicious, ignorant or aggressive as the sum of all your users will be." Software breaks in service precisely because you can't predict in advance what will break it. It's like static analysis. The fact that code compiles without warnings doesn't mean it won't fail in service.

Fitz didn't mention a plus side to continuous deployment that I think is extremely important, which is that it puts enormous pressure on programmers to get it right the first time. It's utterly unforgiving of sloth. Can you imagine knowing that every time you do a check-in, your code goes live 15 minutes later? I think that would "incent" me to write some pretty damn solid code!

In any case, it makes for interesting food-for-thought. Kudos to Fitz. Go IMVU. You guys rock.

Thursday, February 19, 2009

A good regex resource

It's not particularly easy to find regular-expression examples on the Web, but Michael Stutz's Hone your regexp pattern-building skills (at the IBM developerworks site) contains tons of practical examples of regular expressions you can use for form validation and other purposes. While the article is aimed mainly at UNIX professionals (and therefore contains a lot of mentions of egrep and other shell-command cruft), the actual regex syntax used in the examples is POSIX-based and thus the examples should work fine in JavaScript.

If you need to bone up on regular expressions, and you like the "learn by example" approach, check out Stutz's article. It'll get you up to speed quickly.

Tuesday, February 17, 2009

20 ways to speed up your site

I found a great article at IBM's developerWorks site called Speed up your Web pages, by Marco Kotrotsos. It contains 20 very clueful tips for making a site seem snappier. They all have to do with making pages load faster. I thought I'd seen and heard most of the standard tips, tools, and techniques for this, but Marco manages to come up with a few surprises. I recommend the article.

Monday, February 16, 2009

Optimizing the performance of enterprise software



The folks at Day Software have put together a few common-sense pointers for attacking performance-optimization problems involving content management systems. Some of the advice applies pretty generally to all manner of enterprise software, so I thought I'd post David's very brief slideshow on this subject here. (He elaborates on these ideas in a blog entry here.)

I would add only a few side-comments. My main comment is that performance optimization should (in my view) be approached as a kind of debugging exercise. You want to narrow down the source of the main bottleneck. There is, in fact, only one main bottleneck. After you find (and fix) that bottleneck, you will proceed to find the new "main bottleneck," then fix it. Lather, rinse, repeat.

It's very easy to get side-tracked and waste time "optimizing" something that's completely unimportant. Careful testing will tell you what's important. What you think is important may not be.

Just as with any debugging exercise, you never start by fiddling with a million variables at once: You're not looking to find a million sources of trouble, you're looking to find the main source of trouble.

It's rare, in my experience, that performance is ever gated by several factors of more-or-less equal magnitude. Every time I have gone in search of more performance, I've found that there was always one bottleneck that greatly surpassed all others in importance. When I finally found and eliminated that main bottleneck, there'd be a new one that outstripped all others. Eliminating the first choke point might give, say, a three-fold speedup. Eliminating the second one might give another two-fold increase in performance. Each subsequent "bottleneck-removal" adds to the multiplier effect. It's not unusual that removing three bottlenecks in a row gives an overall ten-fold increase in performance. (Anyone who has ported code to assembly language knows what I am talking about.)

Something else I'd add is that best practices always beat better hardware. (Using a clever algorithm in place of a stupid one constitutes a best practice for purposes of this discussion.) Throwing more hardware at a problem is seldom worthwhile (and sometimes leads to more performance problems, actually).

So, don't go looking for performance problems in a million places. Generally the answer is in one place.

Saturday, February 14, 2009

IDE-as-a-service: Bespin is only the beginning


Introducing Bespin from Dion Almaer on Vimeo. [ you should see a video here, but if you don't, I apologize ]

By now, news of Mozilla Labs' Bespin online code editor is all over the Web. I don't think anyone seriously doubts that it's an idea whose time has come. In fact, the timing was rather serendipitous for me, because just a couple of days before the Bespin announcement hit, I had spent a morning talking with a friend (the CTO of a well-known software company) about a range of development issues, and one of the topics we spent quite a bit of time talking about was remote development.

You can do remote development and debugging right now, of course (with things like Eclipse over RMI-IIOP, if you're crazy enough to want to do it that way). But what my friend and I were discussing was an AJAX-based emulation of Eclipse in the browser. Not actual Eclipse with all its features, but a kind of Eclipse-Light facade over a reasonably powerful set of online dev tools, which would of course include an editor with typeahead and all the rest. But that would just be the start.

We talked about things like:
  • Static analysis as a service
  • Compilation as a service
  • Import-latest-packages as a service
  • JAR-it-all-up as a service
  • Sanity checking as a service
  • Wrap-my-crappy-Javascript-code-blocks in try/catches as a service
  • Format my fugly code as a service
  • Help me oh dear God create OSGi bundles as a service
  • Translate my Java code to some-other-language as a service
  • Give me expanded tooltip help for Dojo [or other library] as a service
  • Infinite Undo as a service
And so on. (Well, maybe we didn't discuss every one of those ideas. But we should have!)

Maybe the Bespin guys can add to that list. Or better yet, implement some of it. In any case, let's don't stop with just a code editor. That's way too simple. We need more than that. Lots more.

Friday, February 13, 2009

A script to add Unfollow buttons to Twitter.com/home




I've mentioned on Twitter that I have been working on a bunch of Twitter scripts (I'm up to a dozen so far) designed to do different things, running either from Greasemonkey or within OpenOffice. One of the scripts I wrote is something called AddUnfollowButtons (source code here).

Call me old-fashioned, but I still use twitter.com/home quite a lot (rather than a special Twitter client) to view tweets and add followees. I'm always adding new follows, checking them out for a while, then unsubscribing the ones that spend too much time talking about their pet canary or whatever. Trouble is, it's easier to follow someone than to unfollow them, and my "following" list (people I follow) gets bigger and bigger, but hardly ever smaller.

When I'm deciding whether to follow someone, I inevitably navigate to that person's page and check out that person's last 20 or so tweets. If I like what I see, I click the Follow button, then head back to my home page.

But when I want to stop following someone, I usually know immediately. I don't need to navigate to the person's home page and check his or her last 20 tweets, because I've already seen enough of that person's tweets on my own home page to know I don't want to follow them any more.

So I needed some way to "unfollow" people one by one on my twitter.com/home timeline without leaving the page. I decided to try to write a Greasemonkey script to do that. And it works!

The script puts an "unfollow" button under each user-thumbnail beside each status update. All you do if you don't want to follow that person any more is clilck the button. The script does an AJAX call to Twitter (per the Twitter REST API) and removes that person from my "following" list, then refreshes the page. (It doesn't have to refresh the page ... this is AJAX after all ... but I want it to, so I can see my "following" count decrement -- and wipe the page clean of that person's tweets.)

Let me know if you like the script, and if you modify it, point me to the new version so I can try it out.

Please note
that you need to insert your own Twitter username and password in the code if you want to avoid a credentials challenge at runtime. (I've clearly commented the line where you need to do this. It's about two-thirds of the way down.)

Retweet this.

Wednesday, February 11, 2009

Data URLs to TinyURLs, and vice versa

Last night, I made an exciting discovery.

I discovered that you can convert data URLs (RFC 2397) to TinyURLs, which means you can poke a small Gif or PNG image or anything else that can be made into a data URL into the TinyURL database for later recovery. That means you can poke text, XML, HTML, or anything else that has a discrete mime type, into TinyURL (and do it without violating their Terms of Service; read on).

If you're not familiar with how TinyURLs work: The folks at TinyURL.com have a database. When you send them a long URL (via the HTML form on their web site), they store that long URL in their database and hand you back a short URL. Later, you can point your browser at the tiny URL. The TinyURL folks take your incoming request, look at it, fetch the corresponding long URL from the database, and redirect your browser to the long-URL address.

Think about what this means, though. In essence, you're getting database storage for free, courtesy of TinyURL.com. Of course, you can't just poke anything you want into their database: According to the Terms of Service, the TinyURL service can only be used for URLs.

But according to IETF RFC 2397, "data:" is a legitimate scheme and data-URLs are bonafide URLs. And the HTML 4 spec (Section 13.1.1) specifically mentions data-URLs. I take this to mean data URLs are in fact URLs, and can therefore be stored at TinyURL.com without violating the TinyURL Terms of Service.

This leads to an interesting use-case or two. Traditionally, people have talked about data-URLs in the context of encoding small Gif images and such. Data URLs never caught on, because IE7-and-earlier provided poor support for them, and even today, IE8 (which does support some data URLs) imposes security constraints that make it hard to IE users to deal with all possible varieties of data URL. But IE is the exception. All other modern browsers have built-in support for data URLs.

It's important to understand, you aren't limited to using data URLs to express just tiny images. Anything that can be urlencoded (and that has a well-known mime type) can be expressed as a data URL. Here is a JavaScript function for converting HTML markup to a data URL:

function toDataURL( html ) { // convert markup to a data URL

var preamble = "data:text/html;charset=utf-8,";
var escapedString = escape( html );
return preamble + escapedString;
}
Try this simple experiment. Run the above code in the Firebug console (if you use the Firebug extension for Firefox), passing it an argument of

"<html>" + document.documentElement.innerHTML + "</html>"

which will give you the data URL for the currently visible page. Of course, if you try to navigate to the resulting data URL, it may not render correctly if the page contains references to external resources (scripts, CSS, etc.) using relative URLs, because now the "host" has changed and the relative URLs won't work. Even so, you should at least be able to see all the page's text content, with any inlined styles rendered correctly.

Still not getting it? Try going to the following URL (open it in a new window):

http://tinyurl.com/c7ug9a

(Note to Internet Explorer users: Don't expect this to work in your browser.)

You should see the web page for IETF's RFC 2119. However, note carefully, you're not visiting the IETF site. (Look in your browser's address bar. It's a data URL.) The entire page is stored at TinyURL.com and is being delivered out of their database.

Obviously I don't advocate storing other people's web content at TinyURL.com; this was just a quick example to illustrate the technique.

One thing that's quite interesting (to me) is that unlike other "URL rewriting" services, the TinyURL folks don't seem to mind if your URL is quite long. I haven't discovered the upper limit yet. What you'll find, I think, is that the practical upper limit is set by your browser. I seem to recall that Mozilla has a hard limit of 8K on data-URL length (someone please correct me). It's browser-implementation dependent.

Here are some possible use-cases for TinyURL data-URL usage:
  • Encode a longer-than-140-characters comment that you want to point Twitter followers to. Storing it at TinyURL means you don't have to host the comment on your own site.

  • You could create simple blog pages that only come from TinyURL's database. Domainless hosting.

  • You could encode arbitrary XML fragments as data-URLs and store them in TinyURL, then retrieve them as needed via Greasemonkey AJAX calls. (This would be a cool way to store SVG images.)

  • You could passivate JavaScript objects as JSON, convert JSON objects to data-URLs, and store them in the TinyURL database for later use.

I'm sure there are many other possibilities. (Maybe you can post some in a comment on this blog?)

Someone will say "But isn't TinyPaste or ShortText.com designed for exactly this sort of thing? Why use TinyURL?" The answer is, with TinyURL, you get back the actual resource, not a web page containing a bunch of ads and CSS and other cruft wrappering your content. With data URLs, the URL is the content.

Please retweet this if you find the idea interesting, and let me know what you decide to build with it. (Of course, after this blog, TinyURL folks may decide to modify their Terms of Service. But let's hope not.)

Thursday, February 05, 2009

OpenCalais-OpenOffice mashup

OpenCalais is one of the most innovative and potentially disruptive online services to hit the Web in recent memory. To understand its importance, you have to be a bit of a geek, preferably a text-analytics or computational-linguistics geek, maybe an information-access or "search" geek, or a reasonably technical content-technology freak who understands the potential uses of metadata. It's not easy to sum up OpenCalais in a few words. Suffice it to say, though, if you haven't heard of OpenCalais before, you should visit http://www.opencalais.com. It's an interesting undertaking, to be sure.

One of the services OpenCalais exposes is automatic extraction of entity metadata from text. If you call the OpenCalais service using the proper arguments, you can essentially pass it any kind of text content you want (an article, a blog, a Wikipedia entry, an Obama speech, whatever) and the service will hand you back an itemized list of the entities it detected in the text. "Entities" means things like names of persons, cities, states or provinces, countries, prices, e-mail addresses, industry terms -- almost anything that would qualify as a "proper noun" or a term with special significance (not just keywords).

The OpenCalais service brings back more than a list of terms. It also reports the number of occurrences of the terms and a relevancy score for each term. The latter is a measure of the relative semantic importance of the term in question to the text in question. This score can help in determining cut-offs for automatic tagging, ordering of metadata in tag clouds, and other purposes. It's a way to get at the "aboutness" of a document.

OpenCalais does many, many things beyond entity extraction. But you should already be able to imagine the many downstream disruptions that could occur, for example, in enterprise search if Text-Analytics-as-a-Service (or heaven forbid, Machine-Learning-as-a-Service) were to catch on bigtime.

The OpenCalais API is still growing and evolving (it's at an early stage), but it's already amazingly powerful, yet easy to use. Writing a semantic AJAX app is a piece of cake.

My first experiment with OpenCalais involved OpenOffice. I use OpenOffice intensively (as a direct replacement for the Microsoft Office line of shovelware), and although OpenOffice (like Office) has more than its fair share of annoyances, it also has some features that are just plain crazy-useful, such as support for Macros written in any of four languages (Python, Basic, beanshell, and JavaScript). The JavaScript binding is particularly useful, since it's implemented in Java and allows you to tap the power of the JRE. But I'm getting ahead of myself.

What I decided to try to do is create an OpenOffice macro that would let me push a button and have instant entity-extraction. Here's the use case: I've just finished writing a long business document using OpenOffice, and now I want to develop entity metadata for the document so that it's easier to feed into my company's Lucene-based search system and shows up properly categorized on the company intranet. To make it happen, I want (as a user) to be able to highlight (select) any portion of the document's text, or all of it, then click a button and have OpenOffice make a silent AJAX call to the OpenCalais service. Two seconds later, the metadata I want appears, as if by magic, at the bottom of the last page of the document, as XML. (Ideally, after reviewing the XML, I would be able to click an "Accept" button and have the XML vanish into the guts of the .odf file.)

I wrote a 160-line script that does this. The source code is posted at http://sites.google.com/site/snippetry/Home/opencalais-macro-for-openoffice. Please note that the code won't work for you until you get your own OpenCalais license key and plug it into the script at line No. 145. For space reasons, I'm not going to explain how to install an OpenOffice macro (or create a toolbar button for it after it's installed). That's all standard OpenOffice stuff.

The key to understanding the OpenCalais macro is that all we're doing is performing an HTTP POST programmatically using Java called from JavaScript. Remember that the JavaScript engine in OpenOffice is actually the same Rhino-based engine that's part of the JRE. This means you can instantiate a Java object using syntax like:

var url = new java.net.URL( "http://www.whatever.url" );

Opening a connection and POSTing data to a remote site over the wire is straightforward, using standard Java conventions. The only tricky part is crafting the parameters expected by OpenCalais. It's all well-documented on the OpenCalais site, fortunately. Lines 105-120 of the source code show how to query OpenCalais for entity data. You have to send a slightly ungainly chunk of XML in your POST. No big deal.

For testing purposes, I ran my script against text that I cut and pasted into OpenOffice from an Associated Press news story about Bernard Madoff's customer list (the investment advisor who showed his clients how to make a small fortune out of a large one). OpenCalais generated the following metadata in roughly three seconds:

<!-- Use of the Calais Web Service is governed by the Terms of Service located at http://www.opencalais.com. By using this service or the results of the service you agree to these terms of service. -->

<!--City: NEW YORK, Danbury, Brookline, Oceanside, Pembroke Pines, West Linn, Company: Associated Press, CNN, Country: Switzerland, Kenya, Cayman Islands, Currency: USD, Event: Person Communication and Meetings, Facility: Wall Street, World Trade Center, Hall of Fame, IndustryTerm: Internet support group, MedicalCondition: brain injury, Movie: World Trade Center, NaturalFeature: San Francisco Bay, Long Island, Organization: U.S. Bankruptcy Court, Person: Bernard Madoff, Alan English, Patricia Brown, Bob Finkin, Bonnie Sidoff, Evelyn Rosen, Teri Ryan, Lynn Lazarus Serper, Sharon Cohen, Sandy Koufax, Jordan Robertson, Neill Robertson, Samantha Bomkamp, ADAM GELLER, John Malkovich, Bernie Madoff, Allen G. Breed, Larry King, Rita, Mike, Nancy Fineman, Larry Silverstein, ProvinceOrState: Florida, Oregon, New York, Massachusetts, Connecticut, Technology: ADAM, --><OpenCalaisSimple>

<Description>

<calaisRequestID>a1d28b3b-4ef7-4aa6-b293-8df46ea5e988</calaisRequestID>

<id>http://id.opencalais.com/KG8hyw2LGKjgRnJnRN86FQ</id>

<about>http://d.opencalais.com/dochash-1/6010f15f-bb32-3e59-9b55-c8fef29d38ed</about>

</Description>

<CalaisSimpleOutputFormat>

<Person count="38" relevance="0.771">Bernard Madoff</Person>

<Person count="20" relevance="0.606">Alan English</Person>

<Person count="16" relevance="0.370">Patricia Brown</Person>

<Currency count="11" relevance="0.686">USD</Currency>

<Person count="10" relevance="0.524">Bob Finkin</Person>

<Person count="9" relevance="0.586">Bonnie Sidoff</Person>

<Person count="8" relevance="0.574">Evelyn Rosen</Person>

<Person count="6" relevance="0.378">Teri Ryan</Person>

<Person count="5" relevance="0.373">Lynn Lazarus Serper</Person>

<ProvinceOrState count="4" relevance="0.578" normalized="Florida,United States">Florida</ProvinceOrState>

<City count="3" relevance="0.428" normalized="New York,New York,United States">NEW YORK</City>

<Facility count="2" relevance="0.165">Wall Street</Facility>

<NaturalFeature count="2" relevance="0.398">San Francisco Bay</NaturalFeature>

<ProvinceOrState count="2" relevance="0.305" normalized="Oregon,United States">Oregon</ProvinceOrState>

<Event count="2">Person Communication and Meetings</Event>

<City count="1" relevance="0.051" normalized="Danbury,Connecticut,United States">Danbury</City>

<City count="1" relevance="0.078" normalized="Brookline,Massachusetts,United States">Brookline</City>

<City count="1" relevance="0.058" normalized="Oceanside,New York,United States">Oceanside</City>

<City count="1" relevance="0.134" normalized="Pembroke Pines,Florida,United States">Pembroke Pines</City>

<City count="1" relevance="0.283" normalized="West Linn,Oregon,United States">West Linn</City>

<Company count="1" relevance="0.031" normalized="Associated Press">Associated Press</Company>

<Company count="1" relevance="0.286" normalized="Time Warner Inc.">CNN</Company>

<Country count="1" relevance="0.249" normalized="Switzerland">Switzerland</Country>

<Country count="1" relevance="0.249" normalized="Kenya">Kenya</Country>

<Country count="1" relevance="0.249" normalized="Cayman Islands">Cayman Islands</Country>

<Facility count="1" relevance="0.286">World Trade Center</Facility>

<Facility count="1" relevance="0.286">Hall of Fame</Facility>

<IndustryTerm count="1" relevance="0.104">Internet support group</IndustryTerm>

<MedicalCondition count="1" relevance="0.078">brain injury</MedicalCondition>

<Movie count="1" relevance="0.286">World Trade Center</Movie>

<NaturalFeature count="1" relevance="0.141">Long Island</NaturalFeature>

<Organization count="1" relevance="0.289">U.S. Bankruptcy Court</Organization>

<Person count="1" relevance="0.031">Sharon Cohen</Person>

<Person count="1" relevance="0.286">Sandy Koufax</Person>

<Person count="1" relevance="0.031">Jordan Robertson</Person>

<Person count="1" relevance="0.031">Neill Robertson</Person>

<Person count="1" relevance="0.031">Samantha Bomkamp</Person>

<Person count="1" relevance="0.297">ADAM GELLER</Person>

<Person count="1" relevance="0.286">John Malkovich</Person>

<Person count="1" relevance="0.279">Bernie Madoff</Person>

<Person count="1" relevance="0.031">Allen G. Breed</Person>

<Person count="1" relevance="0.286">Larry King</Person>

<Person count="1" relevance="0.141">Rita</Person>

<Person count="1" relevance="0.260">Mike</Person>

<Person count="1" relevance="0.083">Nancy Fineman</Person>

<Person count="1" relevance="0.286">Larry Silverstein</Person>

<ProvinceOrState count="1" relevance="0.058" normalized="New York,United States">New York</ProvinceOrState>

<ProvinceOrState count="1" relevance="0.078" normalized="Massachusetts,United States">Massachusetts</ProvinceOrState>

<ProvinceOrState count="1" relevance="0.051" normalized="Connecticut,United States">Connecticut</ProvinceOrState>

<Technology count="1" relevance="0.297">ADAM</Technology>

<Topics>

<Topic Score="0.403" Taxonomy="Calais">Business_Finance</Topic>

</Topics>

</CalaisSimpleOutputFormat></OpenCalaisSimple>

Think about it: With the push of a button, a person creating text in a word processor can generate semantically rich metadata in real time without leaving the word-processor environment, using no IT resources. And at no cost. (OpenCalais is free to all.) With very little work, the entire process could be made to happen 100% transparently. The author doesn't even have to know anything's happening over the wire. At check-in time, his or her document is already a good semantic citizen, by magic.

This is just one tiny example of what can be done with the technology. Maybe you can come up with others? If so, please keep me in the loop. I'm interested in knowing what you're doing with OpenCalais.

Retweet this.

OpenCalais: Metadata-as-a-Service


By now, you've probably heard of OpenCalais, the free (as in free) online metadata-extraction service. If you haven't heard of it (my God, man, where have you been?) you should drop what you're doing right now (yes, now; I'll wait) and go immediately to the OpenCalais web site and drink from the fire hydrant. This is a game-changer in the making. You owe it to yourself to be on this bus with both feet.

Basically, what we're dealing with here is Metadata-as-a-Service: OpenCalais is a text analytics web service (created by Thomson Reuters) with SOAP and REST APIs. It's "open" not in the sense of source code, but of open access. Anyone can use the service for any reason (commercial or personal). Calais point man Thomas Tague explains the motivations behind OpenCalais in a Rob McNealy podcast interview, remarkable for (among other things) Tague's surprising explanation of why a megalith like Reuters would offer such a powerful, valuable online service for free.

I've been experimenting with the API (more on that tomorrow) and I have to say, I'm impressed. You can query the OpenCalais service, sending it text in any of several formats, and receive metadata back in your choice of RDF, "text/simple", or Microformats (great if you're wanting big lists of rel-tags). The "text/simple" format is great for entity extraction, and I'll give source code for how to do that tomorrow.

Response data comes back in your choice of XML or JSON. So yeah, it means what you think it means: Text analytics is now the province of AJAX. And it's free. Mash away.

Like I say, I've been experimenting with the APIs, and I've come up with a fairly impressive (if I may say so) little demo, written in JavaScript, that will erase any residue of doubt in anyone's mind as to how powerful the Metadata-as-a-Service metaphor is. Return here tomorrow for the demo, with source code.

Meanwhile, Twitter me at @kasthomas if you have questions. And please retweet this.

Wednesday, February 04, 2009

A Semantic Web Crash Course

I finally found a wonderfully terse, easy-to-follow "executive summary" (not exactly short, though) explaining How to publish Linked Data on the Web: In other words, how to make your site Semantic-Web-ready.

If you've struggled with trying to visualize how the various pieces of the Semantic Web fit together (all the RDF-based standards, for example), and you still feel as though you aren't quite grasping the big picture, go read How to publish Linked Data on the Web. It'll bring you up to speed fast. The authors deserve special mention (join me in a polite round of applause, if you will):
Chris Bizer (Web-based Systems Group, Freie Universität Berlin, Germany)
Richard Cyganiak (Web-based Systems Group, Freie Universität Berlin, Germany)
Tom Heath (Knowledge Media Institute, The Open University, Milton Keynes, UK)
I hope vendors in the content-management space will get to work producing tools aimed at helping people implement "highly semantic sites" (tm). Search 2.0-and-up will rely heavily on linked data, and the advantages of a linked-data-driven Web in terms of enabling thousands of Web APIs to be conflated down to scores or hundreds will become apparent quickly once the ball starts rolling.