Thursday, December 6, 2012

Elephant in the Room

**** Looking BACK to 2012 ****


Full disclosure: Earlier in my career I developed two different Real-Time Operating System (RTOS) kernels. I worked with many others, including my all time favorite RTOS called QNX. When I look at a modern cell phone, tablet or laptop, I can't help but think of the real-time events, signals and messages that converge to provide the great user experiences we know and love today.

Which brings me to elephants. Hadoop is a marvellous slice of technology. It implements, among other things:
  • A fully distributed fault tolerant file system known as the Hadoop File System (HDFS).
  • An environment for distributing code over a network and then executing that code efficiently.
  • An elegant parallel programming framework called MapReduce.
  • Support for many different programming languages including a boat load of utility functions to make MPP programming easier.
  • A wide choice of data input, output, transfer and serialization options.
  • A robust and active open source development community.
What concerns me about Hadoop is its lack of real-time focus. Code and data elements take time to migrate through the Hadoop deployment environment. This means a Hadoop cluster needs to first warm up when starting an analysis. Data is collected, sharded (partitioned) and then processed. Processing code is loaded, pushed out to available processors and then executed. Results are then collected and piped to a final destination. Having lots of processors and bandwidth is great. Hadoop can be slow to get up to speed.

This brings me full-circle to stream programming. The ability to map, execute, reduce and analyze data right now, has a very bright future.  Here's what I think will happen:
  • Java and C++ will support RTOS and other distributed functions natively.  
  • New 4GL languages will emerge to leverage distributed technology. Coding the same high-level functions over and over is tedious. Vendors like Aster Data, Algorithmics and SPSS are building out these languages today.
  • Data models will continue evolving to make real-time distributed access to data easier.  It's not about normalization.  It's about speed.
  • Hadoop lives! I see it moving to toward building dynamic datamarts to take advantage of Hadoop's great throughput characteristics.
  • Business will eventually demand real time streams so its users can get answers now.

Coming Soon: I will provide some stream programming examples for the Big Data Novice (i.e., no RTOS experience necessary).




No comments:

Post a Comment