Monday, June 24, 2013

Case Study: "What are Important Big Data Concepts?"

**** Looking BACK to 2013 ****
Much of this stuff is still true :)

The State of Text

Natural Language Processing (NLP) is a core technology used in text analytics to understand the meaning of unparsed text passages. This type of analytics has come a long way in the past decade.  Deloitte does a nice job describing value in Text Analytics - The three minute guide. Just beware: The accuracy and reliability of many text mining technologies diminishes when the length of text passages get smaller [Bland et al,  Furlan et al].  Problems related to short text samples remains important since social networks, including Twitter, Facebook and Reddit, routinely create text excerpts that are too short to be reliably understood by traditional NLP algorithms [Iorica et al].

I wholeheartedly expect this situation to improve. Statistical text analytics is viable today. And deeper analysis in vertical domains, like healthcare, is quickly becoming a big data success story.  Just set your expectations accordingly.  Betting on text analytics across multiple business domains may leave you wanting in 2013.    

The State of Structured and Semi-Structured Text

Readers of this blog know: The more data is structured, the more I like it. Not because of quantity but because of quality.  Data that has been curated and prepared in structured or semi-structured form, is routinely superior [Hanig et al].  Balancing your analysis to include all types of data remains a very good practice for the foreseeable future.

Simple Case Study:
"Finding Important Big Data Concepts"

A few months ago, I wanted to build a list of the top ten most useful big data concepts

Unstructured searches, based on keyword terms, using Google, Bing and WolframAlpha yielded a 'diverse' set of results that described hardware provisioning and Hadoop along with various solution providers offering services and consulting. Conclusion: These simple search oriented queries produced somewhat underwhelming results.

Inconclusive search engine results is a common problem that I see when trying to dig into a broad subject area like "big data". To get more definitive results, I built a data set using various 'value proposition' statements from the public websites.  I pulled data for 50 Big Data Startups based on web searches, including lists from various technology sites. The results: concise data that was more interesting too.

Using topic clusters, I was able to build a collection of about 30 concepts. I then reduced that list to a 'top 10 ranking' (a process I will describe in an upcoming post). 

Here is my list of  Top 10 Big Data Concepts:

Why is the list potentially better? The intriguing aspect of this result set is: the data sources. Each source is curated in some fashion. Data from sources from public sites, give me links to other sites worthy of investigation. Publicly disclosed mission statements and value propositions from these referenced  vendors, give me definitive text that is easy to parse and organize into related topics.  It is not structured data - just data that is easier to structure.
Is the list actionable? When I present this kind of list in public presentations, I see that people are very interested. They copy down the results, thinking it offers great insight. 

My opinion: This list might be very significant.  But until we do more analysis, I would say this list contains more speculative than actionable facts.