Wednesday, December 5, 2012

Hadoop to the Rescue

**** Looking BACK to 2012 ****



You can't talk about big data for very long without hearing about Apache Hadoop and its related MapReduce programming model. You often hear about its complex deployment strategies and obtuse programming environment.  What's going on here?

First, a single line summary for both:
  • Hadoop - Software framework (programming libraries) for data-centric parallel applications. 
    • MapReduce - Software framework to run large-scale parallel functions. 

    Hadoop and MapReduce are fairly easy to get your head around. "Hadoop is where you run applications that use MapReduce libraries to process high volumes of data".

    Some other clarifications:
    • Hadoop is not an operating system.  It runs on top of a network of computers that run Unix and recently Windows.
    • Hadoop is not a programming language.  While written in Java, it provides its library of functions to popular languages like C++, Ruby and of course Java.
    • Hadoop is not a filesystem. Hadoop is a programming environment that interacts primarily with distributed data like a Hadoop File System (HDFS).  Hadoop systems need data, so it is common to simply say Hadoop
    • Hadoop is not a database. Distributed databases like Hive, Cassandra and Cloudera use a Hadoop infrastructure to provide SQL-like query access over their distributed filesystems.
    • Hadoop is not a solution. In the same way that Windows 8 or database stored procedures are not standalone solutions, neither is Hadoop.
    Hadoop is important. Hadoop helps programmers write distributed applications. Hadoop is cool in many geek circles. But unless you write code for a living, Hadoop is not something you should ever  see.

    In late 2012, we are evolving big data analysis solutions to meeting the needs of line-of-business users. Tools are still immature. Many require you to write code in the Hadoop world to get answers about your large datasets. Clearly, we need to raise the bar for big data analysis to become mainstream.

    Next Time: Bridging Streams and Hadoop to predict the future of big data computing.




    2 comments:

    1. Love the simple explanation of Hadoop and its relationship with databases and file systems, but most of all love your statement that if you don't write code for a living, you should never "see" Hadoop! But as you say, we need to raise the bar so that the results of Hadoop applications can be more easily surfaced to drive business value.

      Great blog as usual!

      ReplyDelete
    2. Mario: I think Hadoop has abright future - maybe in a different area than expected... And that's today's blog topic :)

      ReplyDelete