Monday, December 10, 2012

Unstructured Text End Game

**** Looking BACK to 2012 ****

A long standing adage about corporate data sources is: "80% of Business Data is Unstructured". This quote has been referenced so many times, it is hard to find the original attribution. Some say IBM  quoted this number first in conjunction with the initial release of DB2 in 1983. Others say Merill-Lynch in 1998. IDC, Gartner and Forrester quote this number so often, it suspect it is etched on their office walls.

Do a Google search for "big data unstructured" and you will find that text data is the key to solving all your big data woes.

I certainly agree that managing unstructured data is essential.  But is that all there is?

Big data is about correlating vast amounts of information. Up to 95% of business data is locked in data silos making it inaccessible to other business segments. Security is a concern but incompatible data keys remain the most common barrier.

Governments report to constituents as frequently as an SEC company. Infrastructure details, population statistics, transportation data and demographics are examples of the valuable information provided by various levels of the private sector. After asking over 100 partners and customers, I found only one who uses government data regularly. In this small sample, over 99% of users didn't leverage government data.  Most admitted they had never even thought of it.

Next Time: Leveraging structured data

Thursday, December 6, 2012

Elephant in the Room

**** Looking BACK to 2012 ****

Full disclosure: Earlier in my career I developed two different Real-Time Operating System (RTOS) kernels. I worked with many others, including my all time favorite RTOS called QNX. When I look at a modern cell phone, tablet or laptop, I can't help but think of the real-time events, signals and messages that converge to provide the great user experiences we know and love today.

Which brings me to elephants. Hadoop is a marvellous slice of technology. It implements, among other things:
  • A fully distributed fault tolerant file system known as the Hadoop File System (HDFS).
  • An environment for distributing code over a network and then executing that code efficiently.
  • An elegant parallel programming framework called MapReduce.
  • Support for many different programming languages including a boat load of utility functions to make MPP programming easier.
  • A wide choice of data input, output, transfer and serialization options.
  • A robust and active open source development community.
What concerns me about Hadoop is its lack of real-time focus. Code and data elements take time to migrate through the Hadoop deployment environment. This means a Hadoop cluster needs to first warm up when starting an analysis. Data is collected, sharded (partitioned) and then processed. Processing code is loaded, pushed out to available processors and then executed. Results are then collected and piped to a final destination. Having lots of processors and bandwidth is great. Hadoop can be slow to get up to speed.

This brings me full-circle to stream programming. The ability to map, execute, reduce and analyze data right now, has a very bright future.  Here's what I think will happen:
  • Java and C++ will support RTOS and other distributed functions natively.  
  • New 4GL languages will emerge to leverage distributed technology. Coding the same high-level functions over and over is tedious. Vendors like Aster Data, Algorithmics and SPSS are building out these languages today.
  • Data models will continue evolving to make real-time distributed access to data easier.  It's not about normalization.  It's about speed.
  • Hadoop lives! I see it moving to toward building dynamic datamarts to take advantage of Hadoop's great throughput characteristics.
  • Business will eventually demand real time streams so its users can get answers now.

Coming Soon: I will provide some stream programming examples for the Big Data Novice (i.e., no RTOS experience necessary).

Wednesday, December 5, 2012

Hadoop to the Rescue

**** Looking BACK to 2012 ****

You can't talk about big data for very long without hearing about Apache Hadoop and its related MapReduce programming model. You often hear about its complex deployment strategies and obtuse programming environment.  What's going on here?

First, a single line summary for both:
  • Hadoop - Software framework (programming libraries) for data-centric parallel applications. 
    • MapReduce - Software framework to run large-scale parallel functions. 

    Hadoop and MapReduce are fairly easy to get your head around. "Hadoop is where you run applications that use MapReduce libraries to process high volumes of data".

    Some other clarifications:
    • Hadoop is not an operating system.  It runs on top of a network of computers that run Unix and recently Windows.
    • Hadoop is not a programming language.  While written in Java, it provides its library of functions to popular languages like C++, Ruby and of course Java.
    • Hadoop is not a filesystem. Hadoop is a programming environment that interacts primarily with distributed data like a Hadoop File System (HDFS).  Hadoop systems need data, so it is common to simply say Hadoop
    • Hadoop is not a database. Distributed databases like Hive, Cassandra and Cloudera use a Hadoop infrastructure to provide SQL-like query access over their distributed filesystems.
    • Hadoop is not a solution. In the same way that Windows 8 or database stored procedures are not standalone solutions, neither is Hadoop.
    Hadoop is important. Hadoop helps programmers write distributed applications. Hadoop is cool in many geek circles. But unless you write code for a living, Hadoop is not something you should ever  see.

    In late 2012, we are evolving big data analysis solutions to meeting the needs of line-of-business users. Tools are still immature. Many require you to write code in the Hadoop world to get answers about your large datasets. Clearly, we need to raise the bar for big data analysis to become mainstream.

    Next Time: Bridging Streams and Hadoop to predict the future of big data computing.

    Tuesday, December 4, 2012

    Streaming Data - Time is Money

    **** Looking BACK to 2012 ****

    Streaming data is rapidly becoming an important piece of the big data puzzle. It encompasses data events produced over short time intervals that live or have value for an even shorter period of time. Typically maximum return is realized before the next event occurs. Streaming analysis turns this information into instantaneous actions that often provides feedback into the analytic algorithm itself.

    Streaming Example

    Consider a stock market trading system that evaluates the risk of each transaction from several points of view. Different market players have a different perspective on each transaction:
    • Regulators need to monitor the number of sell versus buy orders to ensure liquidity. They also need to watch trading block size, historic activity and price spreads to avoid what is now called a flash crash. 
    • Traders need to do their job and execute transactions rapidly. At the same time, their brokerages  must guard against human error, system failures and the occasional rogue trader.
    • Banks and financial institutions must also execute rapid trades while maintaining absolute compliance with regulations like reserve requirements. 
    In all of the above cases - and many more - time is a crucial element in the value equation. The time to execute or stop a trade is now. Even one minute is too long to wait for a result.  The value of analysis at the moment a transaction is made cannot be overstated. The cost of fixing bad trades often rises exponentially with time.

    Next Time: How can a typical business benefit from this type of analysis?  

    Monday, December 3, 2012

    Realtime Matters

    **** Looking BACK to 2012 ****

    Data streams and real-time analysis are new disciplines inspired by big data.

    But what about business?  They don't need real-time processing, right?

    Business users are moving to mobile platforms faster than almost any segment. Desktop analysis and data correlation requires many iterations to get things right.

    No one wants to wait.  And that's why adapted real-time methodologies, or the desire to get my results now, is the most important trend in business oriented big data.

    This week: we will look at methods for getting answers faster and more accurately than we ever have.

    Friday, November 30, 2012

    Friday (Off) Topic - Burning Platform

    A long held strategy for dealing with institutional resistance to change is: Create a Burning Platform.  

    The idea is: People resist change.  Staying where they are is generally easier than moving from the status quo.

    burning platform is a way to challenge people.  In business it says a market, product or way of doing things has evolved to the point such that staying where you are is the least comfortable place to be.  When a platform becomes hot enough,  it is easier to jump or embrace change than live with the status quo.

    Established businesses often land on burning platforms. Newspapers, travel agents, libraries, print magazines, and broadcast television have all had to "jump" to meet the challenges of their changing markets.

    Big data creates a large number of burning platforms. We will review many of them in the weeks to come. In the meantime, consider defining you own burning platforms to institute change around you.

    Bonus Tip #1

    Here's a burning platform that exists in most organizations:
        "If we don't do this now, then <insert competitor name here> will!

    This line of reasoning motivates colleagues just about every time - and is often true more frequently than that.

    Wednesday, November 28, 2012

    Volume Strategies

    **** Looking BACK to 2012 ****

    Working with large data volumes is never easy.  Here are three strategies that can help you.
    1. Keep your eye on context. Where data comes from, who produced it and when it was created are examples of attributes that assist many types of analysis. Traditional Extract Transfer and Load (ETL) systems take pride in their ability to remove some of these noisy attributes. But any data attribute could eventually be the key to unlocking hidden data correlations. Adding data as you process incoming streams should be expected.
    2. Some attributes are better than others! Time and location are examples of broad filters that can be used to reduce datasets to more manageable numbers that can be subsequently analyzed deeper than is possible across a larger domain. In terms of volume, think reduce first with broad filters and then apply deep analysis with as many attributes as you can tolerate.
    3. Big data analysis gets better the more you work it. You start with a good hypothesis. By retesting your analysis with both supporting and conflicting data, you arrive at a more complete solution.
    Let's look at a very simple example that pulls all these ideas together.

    You look at a set of customer feedback comments. Attributes include rating, date, product, comments, and author name. Assume you have many hundreds of thousands of items and you are keeping all attributes.

    • The first filter you apply is by date to remove comments about obsolete products. This reduces your data to thousands of items. A good broad filter gets you down to workable datasets fast.\
    • You look at low ratings first. Scanning product names and author, you see a rather wide range. But on a hunch, you sort comment text. You see 700 comments are the same (you use a Baysian filter to filter text that is identical with minor differences). You identified an adhoc discriminator: identical comments from different people about different products likely come from the same source and need to be discounted.
    • Finally you look at the referenced product name. You apply a simple text extractor to look for product names you care about. A comparison of remaining items shows that 50% of comments refer to products that differ from the product name attribute associated with the text. You quickly infer that there could be a problem with the ETL process.  This simple revisiting of data shows how your ETL may need some work.  
    Just remember: this scenario is simple. More tools and techniques are needed to work with large datasets.  

    Stay tuned :)

    Tuesday, November 27, 2012

    Closing the Understanding Gap

    **** Looking BACK to 2012 ****

    Yesterday I showed how our ability to understand new sources of data is being hampered by lack of programming and analysis resources.  Today, we start looking at solutions to close the understanding gap.

    Borrowing from my colleagues who work with artificial intelligence and machine learning, we see a set of curves that are remarkably similar to those shown previously to contrast the growth of data with our ability to process it.

    Consider the effort required to successively correlate data when you have correct (green) versus different or conflicting attributes (red).  It becomes easier to confirm you have the same entity as you successively acquire more consistent and correct attributes.  This type of logical funnelling occurs as the number of attribute variations ultimately converge. Consider a web search for an old acquaintance. Compare the search results for their name alone and then after you add keyword terms like city and employer. You can often zero in on an answer that is consistent and correct in a few clicks.

    With incorrect attributes, the opposite behavior is common. Conflicting attributes (or no attributes!)  lead you to successively higher numbers of conflicting elements. Think of a web search for 'bill smith'.  What's the likelihood you will use the "I'm feeling lucky" button?  

    Banks rely on this "incorrect assertion" behaviour when they ask you security questions like "where were you born?", "what's your mother's maiden name?" and "when did you graduate university".  They assume a Google search will yield conflicting results for any person's name.  Further drilling on attributes will lead to more inconsistent attributes that are not easily resolved.


    Get it right early!  When associating data elements by key, name or other semantic attribute, investing in a disambiguation algorithm that gets a better answer early, will significantly reduce the amount of processing and increase the likelihood of converging to a correct result.

    Monday, November 26, 2012

    More Data Solved with More Processing

    **** Looking BACK to 2012 ****

    Let's start digging into big data by examining some myths that can impede early adopters.

    A common belief from people with a technical background: More data can be handled by simply using more processors and programs.  That's scalability!

    But let's look a bit closer at data growth patterns with respect to the resources we typically have.

    We have an intuitive understanding how data continues to accumulate at near exponential rates (the red curve above).  Please insert your favorite big data volume statistic here.

    A sad reality: The number of programmers, developers and analysts is not growing anywhere near as fast (shown by the green curve above)!

    Our ability to comprehend information is the gap between both curves. The shocking truth: Our understanding is actually shrinking relative to the amount of data available.  Or in other words,  organizations potentially understand less than they used to - and less than many of us expect.

    Tomorrow: We'll discuss a key strategy for bridging this ever growing knowledge gap.

    Friday, November 23, 2012

    Friday (Off) Topic - Shorter Takes Longer

    "If you want me to speak for an hour, I can start now.  If you want me to talk for five minutes, it will take awhile"  -- Paraphrased from Winston Churchill, UK Prime Minister WWII 

    I believe we are losing the art of public speaking.  Web conferencing and PowerPoint are the unintended weapons in this battle. "Let's have a quick one hour meeting (where I will read you my slides)".  Is your head hurting yet?

    You know the drill: You can read slides as well as the presenter.  So... What do you do? You work on your next presentation or read email while the presenter drones on with virtually no one listening.  On a one hour call with many 'listeners', the lost productivity is immense.

    Contrast this scenario with a good TED talk.  20 minutes maximum and EVERYONE gets it.  You may not agree with the speaker but you remember just about everything that was said.  Can you really say that about your last 60 minute teleconference?

    Here are my 4 simple tips for better presentations: 
    1. Tell a story.  Have an introduction, body and compelling ending. People notice when you take them on a journey that ends with a thought provoking conclusion.
    2. Don't just read your slides.  It puts everyone to sleep and gets them doing other things rather than paying attention to you. The best way to avoid this trap: don't make your slides into a document.  Some people will complain because they will be forced to listen. But I think that's a good thing.
    3. Engage your audience.  Ask questions, tell stories and add commentary that is specific to your listeners Don't change your slides. Just talk about specifics that your audience is interested in.
    4. Be responsible.  A 1 hour presentation to 40 people consumes one week of collective time!  Preparing and presenting good content is one of the best ways to get maximum return on this investment of human capital.  The common time slot is often an hour. Try using half that. Trust me. No one will complain that you finished early with a compelling and concise presentation.  In fact, they will love you for it! 

    About Friday (Off) Topics: To celebrate TGIF, I dedicate Fridays to random topics that I hope you are interested in.

    Thursday, November 22, 2012

    Big Data Types

    **** Looking BACK to 2012 ****

    We intuitively know Big Data is about new datasources.  It is data that comes from new places, in many formats.

    Here are the types of information often mentioned:
    • Transactional - HIGH VOLUME operational data often in its pre-warehoused state.  This data runs the business but doesn't necessarily rollup into summary form without a fair bit of help (i.e., lots of cleansing, normalizing and correlating)
    • Machine Data - FAST MOVING real-time data from automated sources.  Often very messy.  Can be difficult to relate to warehouse data without plenty of extra semantic muscle.
    • Social Data - The INFINITELY VARIABLE source of all knowledge - and the often source of nothing at all.  Finding the relevant needles in this giant haystack is challenging.
    • Enterprise - THE VALUABLE STUFF THAT RUNS YOUR BUSINESS.  Many of my colleagues say this data has 'veracity'.  I just say, 'This is the data that business trusts'.  
    Next week I want to start talking about data volume, velocity, variety, and value.  I'll explode a few myths - and have some fun too.

    Tomorrow with be (Off) Topic Friday - where I raise an unrelated but hopefully enlightening topic to celebrate TGIF.

    Wednesday, November 21, 2012

    What You Know + What You Need

    **** Looking BACK to 2012 ****

    SmarterData is about associating the data you know and trust with the wide range of other sources available outside your current line of business. Finding alternate sources is not very difficult. Cutting through the noise to get the best correlation with the data you trust, is the hard part.

    Be prepared to do a couple of things:
    • Iterate. Iterate. Iterate. Your first try is never your best.  Expect to learn as you go.
    • Understand the value of your existing data. Curated, IT-cleansed data is sometimes hard to get. Yet it tends to reflect how your business is run.

      If your product delivery times are slow, your IT reports will likely have an associated metric. If your customers are unhappy about it, you will often need to look outside your current data sources to find out in a timely manner.

      Don't wait for your IT reports to show dissatisfaction through a lower sales metric.  Find out now, so you can do something about it. 

    Tuesday, November 20, 2012

    Welcome to SmarterData

    **** Looking BACK to 2012 ****

    I am unusually excited about Big Data. It is not your father's SQL. Big Data will revolutionize the way we use Business Analytics and do reporting.

    We had an old adage in the Business Intelligence world. We claimed that data warehouses provided the "single version of the truth". I have always been a little uncomfortable with assertion. Today we are being to learn how naive that ideal was.

    Fast forward to 2012. I want to show how Big Data helps you find the "most widely supported version of the truth".  That sounds a whole lot more convincing.

    Over the next little while, I will share thoughts and insights that you won't find anywhere else.  Let's see if we can change the world, one large dataset at a time.