Tuesday, November 27, 2012

Closing the Understanding Gap


**** Looking BACK to 2012 ****


Yesterday I showed how our ability to understand new sources of data is being hampered by lack of programming and analysis resources.  Today, we start looking at solutions to close the understanding gap.




Borrowing from my colleagues who work with artificial intelligence and machine learning, we see a set of curves that are remarkably similar to those shown previously to contrast the growth of data with our ability to process it.

Consider the effort required to successively correlate data when you have correct (green) versus different or conflicting attributes (red).  It becomes easier to confirm you have the same entity as you successively acquire more consistent and correct attributes.  This type of logical funnelling occurs as the number of attribute variations ultimately converge. Consider a web search for an old acquaintance. Compare the search results for their name alone and then after you add keyword terms like city and employer. You can often zero in on an answer that is consistent and correct in a few clicks.

With incorrect attributes, the opposite behavior is common. Conflicting attributes (or no attributes!)  lead you to successively higher numbers of conflicting elements. Think of a web search for 'bill smith'.  What's the likelihood you will use the "I'm feeling lucky" button?  

Banks rely on this "incorrect assertion" behaviour when they ask you security questions like "where were you born?", "what's your mother's maiden name?" and "when did you graduate university".  They assume a Google search will yield conflicting results for any person's name.  Further drilling on attributes will lead to more inconsistent attributes that are not easily resolved.

BIG DATA AFFINITY TIP #1

Get it right early!  When associating data elements by key, name or other semantic attribute, investing in a disambiguation algorithm that gets a better answer early, will significantly reduce the amount of processing and increase the likelihood of converging to a correct result.

No comments:

Post a Comment