SmarterData.ca: December 2012

Monday, December 10, 2012

Unstructured Text End Game

**** Looking BACK to 2012 ****

A long standing adage about corporate data sources is: "80% of Business Data is Unstructured". This quote has been referenced so many times, it is hard to find the original attribution. Some say IBM quoted this number first in conjunction with the initial release of DB2 in 1983. Others say Merill-Lynch in 1998. IDC, Gartner and Forrester quote this number so often, it suspect it is etched on their office walls.

Do a Google search for "big data unstructured" and you will find that text data is the key to solving all your big data woes.

I certainly agree that managing unstructured data is essential. But is that all there is?

Big data is about correlating vast amounts of information. Up to 95% of business data is locked in data silos making it inaccessible to other business segments. Security is a concern but incompatible data keys remain the most common barrier.

Governments report to constituents as frequently as an SEC company. Infrastructure details, population statistics, transportation data and demographics are examples of the valuable information provided by various levels of the private sector. After asking over 100 partners and customers, I found only one who uses government data regularly. In this small sample, over 99% of users didn't leverage government data. Most admitted they had never even thought of it.

Next Time: Leveraging structured data

Thursday, December 6, 2012

Elephant in the Room

**** Looking BACK to 2012 ****

Full disclosure: Earlier in my career I developed two different Real-Time Operating System (RTOS) kernels. I worked with many others, including my all time favorite RTOS called QNX. When I look at a modern cell phone, tablet or laptop, I can't help but think of the real-time events, signals and messages that converge to provide the great user experiences we know and love today.

Which brings me to elephants. Hadoop is a marvellous slice of technology. It implements, among other things:

A fully distributed fault tolerant file system known as the Hadoop File System (HDFS).
An environment for distributing code over a network and then executing that code efficiently.
An elegant parallel programming framework called MapReduce.
Support for many different programming languages including a boat load of utility functions to make MPP programming easier.
A wide choice of data input, output, transfer and serialization options.
A robust and active open source development community.

What concerns me about Hadoop is its lack of real-time focus. Code and data elements take time to migrate through the Hadoop deployment environment. This means a Hadoop cluster needs to first warm up when starting an analysis. Data is collected, sharded (partitioned) and then processed. Processing code is loaded, pushed out to available processors and then executed. Results are then collected and piped to a final destination. Having lots of processors and bandwidth is great. Hadoop can be slow to get up to speed.

This brings me full-circle to stream programming. The ability to map, execute, reduce and analyze data right now, has a very bright future. Here's what I think will happen:

Java and C++ will support RTOS and other distributed functions natively.
New 4GL languages will emerge to leverage distributed technology. Coding the same high-level functions over and over is tedious. Vendors like Aster Data, Algorithmics and SPSS are building out these languages today.
Data models will continue evolving to make real-time distributed access to data easier. It's not about normalization. It's about speed.
Hadoop lives! I see it moving to toward building dynamic datamarts to take advantage of Hadoop's great throughput characteristics.
Business will eventually demand real time streams so its users can get answers now.

Coming Soon: I will provide some stream programming examples for the Big Data Novice (i.e., no RTOS experience necessary).

Wednesday, December 5, 2012

Hadoop to the Rescue

**** Looking BACK to 2012 ****

You can't talk about big data for very long without hearing about Apache Hadoop and its related MapReduce programming model. You often hear about its complex deployment strategies and obtuse programming environment. What's going on here?

First, a single line summary for both:

Hadoop - Software framework (programming libraries) for data-centric parallel applications.
MapReduce - Software framework to run large-scale parallel functions.

Hadoop and MapReduce are fairly easy to get your head around. "Hadoop is where you run applications that use MapReduce libraries to process high volumes of data".

Some other clarifications:

Hadoop is not an operating system. It runs on top of a network of computers that run Unix and recently Windows.
Hadoop is not a programming language. While written in Java, it provides its library of functions to popular languages like C++, Ruby and of course Java.
Hadoop is not a filesystem. Hadoop is a programming environment that interacts primarily with distributed data like a Hadoop File System (HDFS). Hadoop systems need data, so it is common to simply say Hadoop.
Hadoop is not a database. Distributed databases like Hive, Cassandra and Cloudera use a Hadoop infrastructure to provide SQL-like query access over their distributed filesystems.
Hadoop is not a solution. In the same way that Windows 8 or database stored procedures are not standalone solutions, neither is Hadoop.

Hadoop is important. Hadoop helps programmers write distributed applications. Hadoop is cool in many geek circles. But unless you write code for a living, Hadoop is not something you should ever see.

In late 2012, we are evolving big data analysis solutions to meeting the needs of line-of-business users. Tools are still immature. Many require you to write code in the Hadoop world to get answers about your large datasets. Clearly, we need to raise the bar for big data analysis to become mainstream.

Next Time: Bridging Streams and Hadoop to predict the future of big data computing.

Tuesday, December 4, 2012

Streaming Data - Time is Money

**** Looking BACK to 2012 ****

Streaming data is rapidly becoming an important piece of the big data puzzle. It encompasses data events produced over short time intervals that live or have value for an even shorter period of time. Typically maximum return is realized before the next event occurs. Streaming analysis turns this information into instantaneous actions that often provides feedback into the analytic algorithm itself.

Streaming Example

Consider a stock market trading system that evaluates the risk of each transaction from several points of view. Different market players have a different perspective on each transaction:

Regulators need to monitor the number of sell versus buy orders to ensure liquidity. They also need to watch trading block size, historic activity and price spreads to avoid what is now called a flash crash.
Traders need to do their job and execute transactions rapidly. At the same time, their brokerages must guard against human error, system failures and the occasional rogue trader.
Banks and financial institutions must also execute rapid trades while maintaining absolute compliance with regulations like reserve requirements.

In all of the above cases - and many more - time is a crucial element in the value equation. The time to execute or stop a trade is now. Even one minute is too long to wait for a result. The value of analysis at the moment a transaction is made cannot be overstated. The cost of fixing bad trades often rises exponentially with time.

Next Time: How can a typical business benefit from this type of analysis?

Monday, December 3, 2012

Realtime Matters

**** Looking BACK to 2012 ****

Data streams and real-time analysis are new disciplines inspired by big data.

But what about business? They don't need real-time processing, right?

Business users are moving to mobile platforms faster than almost any segment. Desktop analysis and data correlation requires many iterations to get things right.

No one wants to wait. And that's why adapted real-time methodologies, or the desire to get my results now, is the most important trend in business oriented big data.

This week: we will look at methods for getting answers faster and more accurately than we ever have.