Wednesday, December 09, 2009

NoSQL Required Reading

I've been trying to follow the fast-moving world of NoSQL lately, and -- like a visit to the carnival funhouse -- it has left me with double vision, queasy stomach, and a staggering gait. (And it's not even Saturday morning...) Yet I find myself coming back for more.

If you're new to NoSQL, you'll want to do a bit of background reading. I'll keep this quick and limit my recommendations to just the essentials:

1. The Amazon Dynamo paper is classic. Almost everyone in the NoSQL world has read this paper.

2. Google's Bigtable paper. Again, very widely read.

3. Werner Vogels's "Eventually Consistent" (originally published in ACM Queue) is absolutely the one article you should read if you're not clear on the rationale behind "eventual consistency."

4. Brewer's CAP Theorem (a foundational bit of scalability theory) is well-explained here. Also see Brewer's original slides from his famous July 2000 PODC keynote.

5. The slideshows from the June 11, 2009 NoSQL meetup in SFO bring to mind adjectives like classic, influential, seminal, pivotal, memorable. Ignore these decks at your peril.

6. SQL Databases Don't Scale is short, basic, and to-the-point. Essential background info if you're not already a battle-scarred DBA with scalability wounds.

7. For a tabular overview of major distributed databases and how they compare with each other, see NoSQL Ecosystem by Jonathan Ellis. A similar effort is the Quick Reference to Alternative data storages page. Ellis's post is noteworthy for its clueful, concise, helpful narrative (in addition to the tables). The Quick Reference page is mainly tables -- but the tables are more complete than Ellis's.


Other Essential Resources

http://nosql-databases.org/ -- This site bills itself as "Your Ultimate Guide to the Non-Relational Universe!", and also self-assuredly calls itself "the biggest nosql link archiv in the web." It's worth knowing about, certainly.

IMHO, all fully conformant NoSQL geeks MUST follow @nosqlupdate on Twitter.

Conformant geeks SHOULD follow @al3xandru (creator of the excellent MyNoSQL blog and NoSQL Week in Review). NoSQL Week in Review is new. I'm hoping it will be updated regularly. It's excellent.

You MAY want to read recent blog posts by Ricky Ho that aptly summarize key aspects of distributed data-store technology. Two noteworthy examples: Query Processing for NoSQL Databases, and his widely read NoSQL Design Patterns post.

That SHOULD be enough to get you started. ;)

Hadoop and Solr popularity continue to scale well

Blue lines: Hadoop
Red lines: Solr




A quick check of Google Trends shows that Apache Solr (the search server based on Lucene) and Hadoop (the open-source implementation of MapReduce) are popular query terms -- and becoming more popular by the day. (For links to the news stories labelled with flags 'A', 'B', 'C', etc., go to this Google Trends page.)

Likewise: Job trend data from Indeed.com leaves little doubt that Hadoop and Solr skills are increasingly in demand:
Bottom line? If you're a developer, enrichening your Hadoop and/or Lucene+Solr skills can only be considered a good investment.

Wednesday, December 02, 2009

Will HTML5 be SQL-free?

The Los Angeles Times story about Google deprecating Gears in favor of "HTML5" got quite a bit of attention in the Twitterverse yesterday, and has the blogosphere abuzz now, as well.

There are several interesting aspects to the story. One is that the name Microsoft doesn't come up at all. Instead, Apple figures rather prominently in the Times story. In fact, the Times's depiction of Google, Apple, and W3C deciding the fate of the post-2.0 Web evokes images of the Big Three debating Europe's postwar reorganization at the Yalta Conference. One gets the (fanciful) impression that Microsoft's future is, to some extent, being decided without anyone from Redmond being present. Of course, that's not quite true. ;)

Another interesting aspect of the Times story is that it talks about HTML5 wrapping the various technologies that will (ostensibly, soon) make Gears superfluous, when technically speaking, many of the functionalities being attributed to HTML5 in the Times story are, in fact, not part of the HTML5 specification at all. They are part of various other WebApps Working Group specs.

Be that as it may, the decision facing the browser-makers at this point is what kind of offline storage to use for browser-mediated web apps. Specifically, will the underlying store support SQL, or not?

This is (trust me) a Huge Hairy Issue -- HHI(tm) -- and don't let the Times or anybody else tell you otherwise: It's far from being settled yet.

HTML5 talks about SQL quite openly. And it appears Opera, Safari, and (soon) Chrome are implementing WebDB, which is a SQL database in the spirit of the (emerging) Web SQL Database spec. But that's not to say WebDB is a traditional SQL database. It implements SQLite, which is another beast entirely.

Know well, though, not everyone wants SQLite -- or SQL, for that matter. In fact,
Microsoft's Adrian Bateman has stated that Redmond probably will not go that route. In a WebApp WG teleconference, Bateman said:
Microsoft's position is that WebSimpleDB is what we'd like to see
... we don't think we'll reasonably be able to ship an interoperable version of WebDB
... trying to arrive at an interoperable version of SQL will be too hard
WebSimpleDB, also known as the Nikunj proposal (in deference to the author, Nikunj R. Mehta, of Oracle Corporation), proposes a key-value store of the NoSQL variety. And interestingly enough, this approach is getting serious consideration not only from Microsoft but from Mozilla as well. (In the aforementioned teleconference, Mozilla's Jonas Sicking said: "We’ve talked to a lot of developers, the feedback we got is that we really don’t want SQL...")

It's too early to know how it will all play out. About the only thing that's certain at this point is that Google has (thankfully) decided it's more important to back-burner proprietary approaches to web-app infrastructure than to stay on board with mainstream industry standards, even if those standards are (in some cases) still quite fluid and ill-formed. One hopes Microsoft will learn this lesson too. Otherwise? Yalta will decide.


Where Google's power goes


Ever wonder where all the electrical power ends up being used in a Google data center? This is the approximate breakout according to a recent book by Google engineers Luiz André Barroso and Urs Hölzle‌.

Tuesday, December 01, 2009

Unexpected relationship between hard-drive life and temperature


Today, I was reading Failure Trends in a Large Disk Drive Population [PDF], a February 2007 paper by Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso of Google, containing lots of great data on hard-drive failures and the difficulty of predicting same. The above graph depicts one of the more interesting findings, which is that the effect of operating temperature on disk reliability appears to vary with disk age, such that younger drives tend to be more susceptible to low-temperature failures, whereas older drives tend to be more susceptible to failure at elevated temperatures. Results are grouped by age of disk at failure, then broken out into subgroups (histogram bars) based on their operating temperature. So for example, among disks that failed at 3 years of age, the Annualized Failure Rate (AFR) was about 15% for those disks that had had operating temps of 45 deg. C or more, versus a fail rate of 5% for those that had seen temps of less than 30 deg. C.

Many people have assumed that high temps are bad for disks. And indeed maybe they are bad for 3-year-old disks, but disks that fail at younger ages tend to be much more traumatized by cold than by heat. Pinheiro et al. give additional data for this, and it's pretty convincing. For example, if you have a look at Fig. 4 of the paper, you'll see a bathtub curve, showing that extremes of temperature are deleterious to disk life expectancy. Ironically, the bathtub curve reaches its lowest point at around 38 deg. C -- very close to human body temperature.

Wednesday, November 25, 2009

Image editing in Firefox via Jetpack and Pixlr

Jetpack Menu API Tutorial from Aza Raskin on Vimeo.


I have to admit this is pretty cool from a couple of standpoints. First, an easy Menu API is something Jetpack has needed from the beginning, and this API strikes me as 100% spot-on: logical, intuitive, powerful. Secondly, Pixlr itself is just kick-ass. (That's my nomination for understatement of the year.) And the measly 14 lines of integration code needed to get Pixlr operating on a web image from a right-mouse menu command is (dare I say) one of the nicer parlor tricks I've seen this year. In The Year of the Parlor Trick, that's saying something.

Tuesday, November 24, 2009

Hadoop for Bioinformatics

Protein Alignment - Paul Brown from Cloudera on Vimeo.


About 15 minutes into this video, there's an interesting 3D visualization of a running Hadoop job, showing processor nodes as cubes in a spinning pyramid: green nodes are working normally; a node turns black and falls down to the bottom, signalling a failed job on that processor. I thought it was an interesting visualization. But I also found the presentation interesting overall, since I studied molecular biology in grad school and have an interest in bioinformatics. Beyond that, I have an interest, lately, in all things related to scalability. (Let that be a hint of things to come in future blog posts!)

Friday, November 20, 2009

SixthSense Augmented Reality R&D

Thursday, November 19, 2009

SalsaDev

Stop Searching: Find! from salsadev on Vimeo.


The killer UI experience here is:
  1. Highlight an arbitrary piece of content on a Web page. (Select some text in your browser window.)
  2. Let go of the mouse.
  3. A panel appears automagically, containing contextually appropriate webfinds.
The user has essentially created a "query" through the simple gesture of click-dragging across some text on a page.

This ought to give anyone in the Search business an awful lot to think about.