blogorrhea: October 2008

Thursday, October 30, 2008

What's the strangest thing in Java?

There's an interesting discussion going on at TheServerSide.com right now. Someone asked "What’s the strangest thing about the Java platform?"

I can think of a lot of strange things about Java (space precludes a full enumeration here). Offhand, I'd say one of the more disturbing aspects of Java is its ill-behaved (unpredictable) System.gc( ) method.

According to Sun, System.gc( ) is not 100% reliable: "When control returns from the method call, the virtual machine has made its best effort to recycle all discarded objects." Notice the wording ("best effort"). There is absolutely no guarantee that gc() will actually force a garbage collection. This is well known to anybody who has actually tried to use it in anger.

The problem is, in the rare case when you actually do need to use gc(), you really do need it to work (or at least behave in a well-understood, deterministic way). Otherwise you can't make any serious use of it in a mission-critical application. Not to put too fine a point on it, but: If a method is not guaranteed to do what you expect it to do, then it seems to me the method becomes quite dangerous. I don't know about you, but I rely on System calls to work. If you can't rely on a System call, what can you rely on?

Suppose you've written a reentry program for a spacecraft, and you have an absolute need for a particular routine (e.g., to fire retro-rockets) to execute, without interruption, starting at a particular point in time. The spacecraft will be lost and the mission will fail (at a cost to taxpayers of $300 million) if the retro-rockets don't fire on time or don't shut off on time.

Now imagine that just as your program's fireRetroRockets() method is entered, the JVM decides to "stop the world" and do a garbage-collect.

Houston, we have a . . . well, you know.

The point is, if you could call System.gc( ) ahead of time, and count on it doing exactly what you want it to do (collect garbage immediately, so that an uncommanded GC won't happen at the wrong moment), you could save the mission. (Arguably.)

Obviously, this example is somewhat academic. No one in his right mind would actually use Java to program a spacecraft, in real life.

And that, I think, says a great deal about the Java platform.

Wednesday, October 29, 2008

Chaos in query-land

I wrote a micro-rant the other day at CMSWatch.com on the need for an industry-standard syntax for plain-language keyword search. I, for one, am tired of learning a different search syntax for every site I go to. I find myself naively assuming (like an idiot) that every search engine obeys Google syntax. Not true, of course. It's a free-for-all out there. For example, not every search engine "ANDs" keywords together by default. Even at this simple level (a two-keyword search!) users are blindsided by products that behave unpredictably.

At any rate, Lars Trieloff pointed out to me yesterday that Apache Jackrabbit (the Java Content Repository reference implementation, which underpins Apache Sling) implements something called GQL, which is colloquially understood to mean Google Query Language, although in fact it means GQL. It does not implement Google's actual search syntax in comprehensive detail. It merely allows Jackrabbit to support plaintext queries in a Google-like way, so that if you are one of those people (like me) who automatically assumes that any given search widget will honor Google grammar, you won't be disappointed.

It turns out, the source code for GQL.java is remarkably compact, because really it's just a thin linguistic facade over an underlying XPath query facility. GQL.java does nothing more than transliterate your query into XPath. It's pretty neat, though.

I'm all for something like GQL becoming, say, an IETF RFC, so that vendors and web sites can begin implementing (and advertising) support for Google-like syntax. First there will need to be a name change, though. Google already uses "GQL" to describe a SQL-like language used in the Google App Engine. There's also a Graphical Query Language that has nothing to do with Jackrabbit nor Google.

See what I mean? It's chaos out there in query-land.

Tuesday, October 28, 2008

Pixel Bender plug-in for Photoshop

When I first heard about Adobe's Pixel Bender technology, I became very excited. An ActionScript-based pixel shader API? What could be more fun that than? (By now you know what my social life must be like.)

When I saw that PB was a Flash-only technology, my enthusiasm got tamped down a bit. Later, I learned that PB would be supported in After Effects, which had me scratching my chin again. (I've written AE plug-ins before. It's much less punishing than writing Photoshop plug-ins.)

Now it turns out there will be a Pixel Bender plug-in for the next version of Photoshop. According to Adobe's John Nack, "Pixel Bender won't be supported in the box in the next version of Photoshop, but we plan to offer a PB plug-in as a free download when CS4 ships. Therefore it's effectively part of the release."

This is great news for those of us who like to peek and poke pixels but can't be bothered to use the Byzantine C++ based Photoshop SDK.

In case you're wondering what you can do with Pixel Bender, some nice sample images and scripts can be found here. The image shown above was created with this 60-line script.

Nice.

Monday, October 27, 2008

Java 7 gets "New" New I/O package

I've always hated Java I/O with all its convoluted, Rube-Goldbergish special classes with special knowledge of special systems, and the legacy readLine( ) type of garbage that brings back so many bad memories of the Carter years.

With JSR 203 (to be implemented in Java SE 7), we get a new set of future legacy methods. This is Sun's third major attempt in 13 years to get I/O right. And from what I've seen, it doesn't look good. (Examples here.) My main question at this point is where they got that much lipstick.

The main innovation is the new Path object, which seems to be a very slightly more abstract version of File. (This is progress?) You would think any new I/O library these days would make heavy use of URIs, URLs, and Schemes (file:, http:, etc.) and lessons learned in the realization of concepts like REST, AJAX, and dependency injection. No such luck. Instead we have exotic new calls like FileSystem.getRootDirectories()and DirectoryEntry.newSeekableByteChannel(). It's like we've learned nothing at all in the last 20 years.

When I want to do I/O, I want to be able to do something like

dataSrc = new DataGetter( );
dataSrc.setPref( DataGetter.EIGHTBITBYTES );
dataSrc.setPref( DataGetter.SLURPALL );
data = dataSrc.getData( uri );

and be done with it. (And by the way, let me pass a string for the URI, if I want to. Don't make me create a special object.)

I don't want to have to know about newlines, buffering, or file-system obscurata, unless those things are terribly important to me, in which case I want to be able to inject dependencies at will. But don't make me instantiate totally different object types for buffered vs. non-buffered streams, and all the rest. Don't give me a million flavors of special objects. Just let me pass hints into the DataGetter, and let the DataGetter magically grok what I'm trying to do (by making educated guesses, if need be). If I want a special kind of buffering, filtering, encoding, error-handling, etc., let me craft the right cruftball of flags and constants, and I'll pass them to the DataGetter. Otherwise, there should be reasonable defaults for every situation.

I would like a file I/O library that is abstract enough to let me read one bit at a time, if I want; or 6 bits at a time; or 1024 bits, etc. To me, bits are bits. I should be able to hand parse them if I want, in the exact quantities that I want. If I'm doing some special type of data compression and I need to write 13 bits to output, then 3 bits, then 12, then 10, and so on, I should be able to do that with ease and elegance. I shouldn't have to stand on my head or instantiate exotic objects for reading, buffering, filtering, or anything else.

I could write a long series of articles on what's wrong with Java I/O. But I don't look forward to revising that article every few years as each "new" I/O package comes out. Like GUI libraries and 2D graphics, this is something Sun's probably never going to get right. It's an area that begs for intervention by fresh talent, young programmers who are self-taught (not infected by orthodoxies acquired in college courses) and have no understanding at all of legacy file systems, kids whose idea of I/O is HTTP GET. Until people with "beginner's mind" get involved, there's no hope of making Java I/O right.

Friday, October 24, 2008

Enterprise Software Feared Overpriced

I'm being sardonic with that headline, obviously, but I have to agree with Tim Bray, who said in passing the other day: "I just don’t believe that Enterprise Software, as currently priced, has much future, in the near term anyhow."

I take this to mean that the days of the seven-figure software deal (involving IBM, Oracle, EMC, Open Text, etc.) may not exactly be over, but certainly those kinds of sales are going to be vanishingly rare, going forward.

I would take Bray's statement a step further, though. He's speaking to the high cost of enterprise software itself (or at least that's how I interpret his statement). Enterprise systems take a lot of manpower to build and maintain. The budget for a new system rollout tends to break out in such a way that the software itself represents only 10 to 50 percent of the overall cost. In other words, software cost is a relatively minor factor.

Therefore I would extend Bray's comment to say that old-school big-budget Enterprise Software projects involving a cast of thousands, 12 months of development and testing, seven-figure software+services deals, etc., are on the way out. In its place? Existing systems! Legacy systems will be maintained, modified, built out as necessary (and only as necessary) using agile methodologies, high-productivity tools and languages (i.e., scripting), RESTful APIs, and things that make economic sense.

There's no room any more for technologies and systems that aren't provably (and majorly) cost-effective. IBM, Oracle, EMC, listen up: Million-dollar white elephants are on the endangered species list.

Wednesday, October 22, 2008

Lucene site powered by Google!

What's wrong with this picture?

Flash-drive RAID

I stumbled upon the floppy-drive RAID story (see previous blog) as part of a Google search to see if any such thing as a memory stick (Flash-drive) RAID array is available for Vista. No such luck, of course. But there are quite a few blogs and articles on the Web by Linux users who have successfully created ad-hoc Flash RAIDs from commodity USB hubs and memory sticks. (I recommend this June 2008 article from the Linux Gazette and this even more entertaining, not to mention better-illustrated, piece by Daddy Kewl. Definitely do not fail to read the latter!) Linux supports this kind of madness natively.

MacOS is even better for this. Evidently you can plug two sticks into a PowerBook's USB ports and configure them as a RAID array with native MacOS dialogs. (Details here.) How I envy Mac users!

Tuesday, October 21, 2008

Floppy-disk RAID array

This has got to be the funniest thing I've seen all year. And trust me, this has been a funny year.

Daniel Blade Olson, a man after my own heart (even if that phrase doesn't translate well into foreign languages...), has rigged a bunch of floppy drives to form a RAID array. His disturbing writeup is here.

Saturday, October 18, 2008

Fast pixel-averaging

I don't know why it took me so long to realize that there's an easy, fast way to obtain the average of two RGB pixel values. (An RGB pixel is commonly represented as a 32-bit integer. Let's assume the top 4 bits aren't used.)

To ensure proper averaging of red, green, and blue components of two pixels requires parsing those 8-bit values out of each pixel and adding them together, then dividing by two, and crafting a new pixel out of the new red, green, and blue values. Or at least that's the naive way of doing things. In code (I'll show it in JavaScript, but it looks much the same in C or Java):


   // The horribly inefficient naive way:

   function average( a,b ) {

      var REDMASK = 0x00ff0000;
      var GREENMASK = 0x0000ff00;
      var BLUEMASK = 0x000000ff;
      var aRed = a & REDMASK;
      var aGreen = a & GREENMASK;
      var aBlue = a & BLUEMASK;
      var bRed = b & REDMASK;
      var bGreen = b & GREENMASK;
      var bBlue = b & BLUEMASK;

      var aveRed = (aRed + bRed) >> 1;
      var aveGreen = (aGreen + bGreen) >> 1;
      var aveBlue = (aBlue + bBlue) >> 1;
   
      return aveRed | aveGreen | aveBlue;
   }

That's a lot of code to average two 32-bit values, but remember that red, green, and blue values (8 bits each) have to live in their own swim lanes. You can't allow overflow.

Here's the much cleaner, less obvious, hugely faster way:


   // the fast way:

   MASK7BITS = 0x00fefeff; 

   function ave( a,b ) {
      
      a &= MASK7BITS;
      b &= MASK7BITS;  
      return (a+b)>>1;
   }

The key intuition here is that you want to clear the bottom bit of the red and green channels in order to make room for overflow from the green and blue "adds."

Of course, in the real world, you would inline this code rather than use it as a function. (In a loop that's processing 800 x 600 pixels you surely don't want to call a function hundreds of thousands of times.)

Similar mask-based techniques can be used for adding and subtracting pixel values. Overflow is handled differently, though (left as an exercise for the reader).

Friday, October 17, 2008

Loading an iframe programmatically

This is a nasty hack. It's so useful, though. So useful.

Suppose you want to insert a new page (a new window object and DOM document) into your existing page. Not a new XML fragment or subtree on your current page; I'm talking about a whole new page within a page. An iframe, in other words.

The usual drill is to create an <iframe> node using document.createElement( ) and attach it to the current page somewhere. But suppose you want to populate the iframe programmatically. The usual technique is to start building DOM nodes off the iframe's contentDocument node using DOM methods. Okay, that's fine, but it's a lot of drudgery. (I'm sweating already.) At some point you're probably going to start assigning string values to body.innerHTML (or whatever). But then you're into markup-stringification hell. (Is there a JavaScript programmer among us who hasn't frittered away major portions of his or her waking life escaping quotation marks and dealing with line-continuation-after-line-continuation in order to stringify some hellish construction, whether it's a piece of markup or an argument to RegExp( ) or whatever?)

Well. All of that is best left to Internet Explorer programmers. If you're a Mozilla user, you can use E4X as your "get out of stringification-jail FREE" card, and you can use a data URL to load your iframe without passing through DOM hell.

Suppose you want your iframe to contain a small form. First, declare it as an XML literal (which you can do as follows, using E4X):


myPage = <html>
<body>
  <form action="">
     ... a bunch of markup here
  </form>
</body>
</html>;

Now create an iframe to hold it:

   iframe = top.document.createElement( "iframe" );

Now (the fun part...) you just need to populate the iframe, which you can do in one of two ways. You can attach the iframe node to the top.document, then assign myPage.toXMLString() to iframe.contentDocument.body, or (much more fun) you can convert myPage to a data URL and then set the iframe's src attribute to that URL:


    // convert XML object to data URL
 function xmlToDataURL( theXML ) {   

         var preamble = "data:text/html;charset=utf-8,";
         var octetString = escape( theXML.toXMLString( )  );
         return preamble + octetString;
 }

 dataURL = xmlToDataURL( myPage );

 iframe.setAttribute( "src", dataURL ); // load frame

 // attach the iframe to your current page
 top.document.body.insertBefore( iframe ,
       top.document.body.firstChild );

A shameless hack, as I say. It works fine in Firefox, though, even with very large data URLs. I don't recall the exact size limit on data URLs in Mozilla, but I seem to remember that it's megabytes. MSIE, of course, has some wimpy limit like 4096 characters (maybe it's changed in IE8?).

In my opinion, all browsers SHOULD support unlimited-length data URLs, just like they SHOULD support E4X and MUST support JavaScript. Notwithstanding any of this, Microsoft MAY go to hell.

Saturday, October 11, 2008

Russians use graphics card to break WiFi encryption

The same Russians who got in a lot of trouble a few years ago for selling a small program that removes password protection from locked PDF files (I'm talking about the guys at Elcomsoft) are at it again. It seems this time they've used an NVidia graphics card GPU to crack WiFi WPA2 encryption.

They used the graphics card, of course, for sheer number-crunching horsepower. The GeForce 8800 GTX delivers something like 300 gigaflops of crunch, which I find astonishing (yet believable). Until now, I had thought that the most powerful chipset in common household use was the Cell 8-core unit used in the Sony Playstation 3 (which weighs in at 50 to 100 gigaflops). Only 6 of the PS/3's processing units are available to programmers, though, and the Cell architecture is meant for floating-point operations, so for all I know the GeForce 8800 (or its relatives) might be the way to go if you need blazing-fast integer math.

Even so, it would be interesting to know what you could do with, say, an 8-box cluster of overclocked PS/3s. Simulate protein-ribosome interactions on an atom-by-atom basis, perhaps?

Decimal to Hex in JavaScript

There's an easy way to get from decimal to hexadecimal in JavaScript:

  function toHex( n ) { return n.toString( 16 ); }

The string you get back may not look the way you want, though. For example, toHex(256) gives "100", when you're probably wanting "0x0100" or "0x00000100". What you need is front-padding. Just the right amount of front-padding.

// add just the right number of 'ch' characters
// to the front of string to give a new string of
// the desired final length 'dfl'

  function frontPad( string, ch, dfl ) {
     var array = new Array( ++dfl - string.length );
     return array.join( ch ) + string;
  }

Of course, you should ensure that 'dfl' is not smaller than string.length, to prevent a RangeError when allocating the array.

If you're wondering why "++dfl" instead of plain "dfl", stop now to meditate. Or run the code until enlightenment occurs.

At this point you can do:

  function toHex( n ) {
    return "0x" + frontPad( n.toString( 16 ), 0, 8);
  }

  toHex( 256 )  // gives "0x00000100"

If you later need to use this value as a number, no problem. You can apply any numeric operation except addition on it with perfect safety. Addition will be treated as string concatenation whenever any operand is a string (that's the standard JS intepreter behavior), so if you need to do "0x00000100" + 4, you have to cast the hex-string to a number.

  n = toHex( 256 );  // "0x00000100"
  typeof n  // "string"
  isNaN( n )  // false
  x = n * n;  // 65536
  x = n + 256  // "0x00000100256"
  x = Number( n ) + 256   //  512

Wednesday, October 08, 2008

$20 touchscreen, anyone?

Touchless is one of those ideas that's so obvious, yet so cool, that after you hear it, you wonder why someone (such as yourself) didn't think of it ages ago. Aim a webcam at your screen; have software that follows your fingers around; move things around in screen space in response to your finger movements. Voila! Instant touch-screen on the cheap.

Mike Wasserman came up with Touchless as a college project while attending Columbia University. He's now with Microsoft. The source code is free.

Awesome.

Saturday, October 04, 2008

Accidental assignment

People sometimes look at my JavaScript and wonder why there is so much "backwards" notation:

   if ( null == arguments[ 0 ] )
     return "Nothing to do";

   if ( 0 == array.length )
     break;

And so on, instead of putting the null or the zero on the right side of the '==' the way everyone else does.

The answer is, I'm a very fast typist and it's not uncommon for me to type "s" when I meant to type "ss," or "4" when I meant to type "44," or "=" when I meant to type "==".

In JavaScript, if I write the if-clause in the normal (not backwards) way, and I mistakenly type "=" for "==", like so...

   if ( array.length = 0 )
  break;

... then of course I'm going to destroy the contents of the array (because in JavaScript, you can wipe out an array by setting its length to zero) and my application is going to behave strangely or throw an exception somewhere down the line.

This general type of programmer error is what I call "accidental assignment." Note that I refer to it as a programmer error. It is not a syntactical error. The interpreter will be only too happy to assign a value to a variable inside an if-clause, if you tell it to. And it may be quite some time before you are able to locate the "bug" in your program, because at runtime the interpreter will dutifully execute your code without putting messages in the console. If an exception is eventually thrown, it could be in an operation that's a thousand lines of code away from your syntactical blunder.

So the answer is quite simple. If you write the if-clause "backwards," with zero on the left, an accidental assignment will be caught right away by the interpreter, and the resulting console message will tell you the exact line number of the offending code, because you can't assign a value to zero (or to null, or to any other baked-in constant).

In an expression like "null == x" we say that null is not Lvaluable. The terms "l-value" and "r-value" originally meant left-hand value and right-hand value. But when Kernighan and Ritchie created C, the meaning changed, to become more precise. Today an Lvalue is understood to be a locatable value, something that has an address in memory. A compiler will allocate an address for each named variable at compile-time. The value stored in this address (its r-value) is generally not known until runtime. It's impossible, in any case, to refer to an r-value by its address if it hasn't been assigned to an l-value, hence the compiler won't even try to do so and you'll get an error if you try to compile "null = x".

On the other hand, "x = null" is perfectly legal, and in K&R days a C-compiler would obediently compile such a statement whether it was in an if-clause or not. This actually resulted in some horrendously costly errors in the real world, and as a result, today no modern compiler will accept a bare assignment inside an if-clause. (Actually I can think of an exception. But let's save that for another time.) If you really mean to do an assignment inside an if, you must encapsulate it in parentheses.

Not so with JavaScript, a language that (like K&R C) assumes that the programmer knows what he or she is doing. People unwittingly create accidental assignments inside if-clauses all the time. It's not a syntactical error, so the interpreter doesn't complain. Meanwhile you've got a very difficult situation to debug, and the language itself gets blamed. (A poor craftsman always blames his tools.)

As a defensive programming technique, I always put the non-Lvaluable operand on the left side of an equality operator, and that way if I make a typing mistake, the interpreter slaps me in the face at the earliest opportunity rather than spitting in my general direction some time later. It's a defensive programming tactic that has served me well. I'm surprised more people don't do it.

Thursday, October 02, 2008

How this blog looks to Wordle

click for larger version (make your own at wordle)

Wednesday, October 01, 2008

Serialize any POJO to XML

Ever since Java 1.4.2 came out, I've been a big fan of java.beans.XMLEncoder, which lets you serialize runtime objects (including the values of instance variables, etc.) as XML, using just a few lines of code:


   XMLEncoder e = new XMLEncoder(
                      new BufferedOutputStream(
                          new FileOutputStream("Test.xml")));
   e.writeObject(new JButton("Hello, world"));
   e.close();

This is an extraordinarily useful capability. You can create an elaborate Swing dialog (for example) containing dozens of nested widgets, then serialize the whole thing as a single XML file, capturing its state, using XMLEncoder (then deserialize it later, in another time and place, perhaps).

A favorite trick of mine is to serialize an application's key objects ahead of time, then JAR them up and instantiate them at runtime using XMLDecoder. With a Swing dialog, this eliminates a ton of repetitive container.add( someWidget) code, and similar Swing incantations (you know what I'm talking about). So it cleans up your code incredibly. It also makes Swing dialogs (and other objects) declarative in nature; they become static XML that you can edit separately from code, using XML tools. At runtime, of course, you can use DOM and other XML-manipulation technologies to tweak serialized objects before instantiating them. (Let your imagination run.)

As an aside: I am constantly shocked at how many of my Java-programming friends have never heard of this class.

If there's a down side to XMLEncoder, it's that it will only serialize Java beans, or so the documentation says, but actually the documentation is not quite right. (More on that in a moment.) With Swing objects, for example, XMLEncoder will serialize widgets but not any event handlers you've set on them. At runtime, you end up deserializing the Swing object, only to have to hand-decorate it with event handlers before it's usable in your application.

There's a solution for this, and again it's something relatively few Java programmers seem to know anything about. In a nutshell, the answer is to create your own custom persistence delegates. XMLEncoder will call the appropriate persistence delegate when it encounters an object in the XML graph that has a corresponding custom delegate.

This is (need I say?) exceptionally handy, because it provides a transparent, interception-based approach to controlling XMLEncoder's behavior, at a very fine level of control. If you have a Swing dialog that contains 8 different widget classes (some of them possibly containing multiple nested objects), many of which need special treatment at deserialization time, you can configure an XMLEncoder instance to serialize the whole dialog in just the fashion you need.

The nuts and bolts of this are explained in detail in this excellent article by Philip Milne. The article shows how to use custom persistence delegates to make XMLEncoder serialize almost any Java object, not just beans. Suffice it to say, you should read that article if you're as excited about XMLEncoder as I am.