SmarterData.ca: 2013

Monday, June 24, 2013

Case Study: "What are Important Big Data Concepts?"

**** Looking BACK to 2013 ****
Much of this stuff is still true :)

The State of Text

Natural Language Processing (NLP) is a core technology used in text analytics to understand the meaning of unparsed text passages. This type of analytics has come a long way in the past decade. Deloitte does a nice job describing value in Text Analytics - The three minute guide. Just beware: The accuracy and reliability of many text mining technologies diminishes when the length of text passages get smaller [Bland et al, Furlan et al]. Problems related to short text samples remains important since social networks, including Twitter, Facebook and Reddit, routinely create text excerpts that are too short to be reliably understood by traditional NLP algorithms [Iorica et al].

I wholeheartedly expect this situation to improve. Statistical text analytics is viable today. And deeper analysis in vertical domains, like healthcare, is quickly becoming a big data success story. Just set your expectations accordingly. Betting on text analytics across multiple business domains may leave you wanting in 2013.

The State of Structured and Semi-Structured Text

Readers of this blog know: The more data is structured, the more I like it. Not because of quantity but because of quality. Data that has been curated and prepared in structured or semi-structured form, is routinely superior [Hanig et al]. Balancing your analysis to include all types of data remains a very good practice for the foreseeable future.

Simple Case Study:
"Finding Important Big Data Concepts"

A few months ago, I wanted to build a list of the top ten most useful big data concepts.

Unstructured searches, based on keyword terms, using Google, Bing and WolframAlpha yielded a 'diverse' set of results that described hardware provisioning and Hadoop along with various solution providers offering services and consulting. Conclusion: These simple search oriented queries produced somewhat underwhelming results.

Inconclusive search engine results is a common problem that I see when trying to dig into a broad subject area like "big data". To get more definitive results, I built a data set using various 'value proposition' statements from the public websites. I pulled data for 50 Big Data Startups based on web searches, including lists from various technology sites. The results: concise data that was more interesting too.

Using topic clusters, I was able to build a collection of about 30 concepts. I then reduced that list to a 'top 10 ranking' (a process I will describe in an upcoming post).

Here is my list of Top 10 Big Data Concepts:

Why is the list potentially better? The intriguing aspect of this result set is: the data sources. Each source is curated in some fashion. Data from sources from public sites, give me links to other sites worthy of investigation. Publicly disclosed mission statements and value propositions from these referenced vendors, give me definitive text that is easy to parse and organize into related topics. It is not structured data - just data that is easier to structure.

Is the list actionable? When I present this kind of list in public presentations, I see that people are very interested. They copy down the results, thinking it offers great insight.

My opinion: This list might be very significant. But until we do more analysis, I would say this list contains more speculative than actionable facts.

Friday, April 19, 2013

Use Case Decomposition for Big Data

**** Looking BACK to 2013 ****

As I begin to describe the value of a successful ASSESS, ALIGN, ASSERT and ACHIEVE big data methodology, I am reminded about a fundamental prerequisite: Use Cases.

From Wikipedia:
"a list of steps, typically defining interactions between a role or actor and a system, to achieve a goal."

I am convinced that Big Data benefits from Use Case Decomposition. It's a predecessor - and now an integral part of - Agile Software Construction. I am continually surprised at how well Use Cases assist in the analysis of uncertain data.

The Decomposition part is also worth mentioning. Moving from top to bottom, decomposition let's you break complex tasks into smaller, more understandable elements. When reviewing results, you get the chance to assemble all the composite pieces of information to ensure fundamental requirements have been met. It is also a really nice way for every member of a development organization to say: 'I understand what you are requesting and here's what I plan to do about it'.

Tuesday, April 9, 2013

Business Analytics for Big Data in 4 Words

**** Looking BACK to 2013 ****

Readers of this blog probably know the 4 V's of Big Data:

Volume, Velocity, Variety and Veracity (trust)

I would like to offer the 4 A's for Analytics:

Assess - Figure out what you need and where the data resides. Start with visualizations to gain understanding of large datasets. Look for opportunities to connect data. Having lots of data and attributes at this stage is a good thing.

Align - Align data with existing dimensions, metrics and measures to start building better sources of trusted data. Early association of key attributes improves the accuracy of text and entity analytics.

Assert - Here comes the statistics part: Create analytic data from sources and attributes identified in the previous steps. Find new connections with exploration and discovery. Achieve quantitative insight as you create new columns of data from old. Revisit previous assumptions to ensure you have consistent data. The more the data aligns with your assertions, the more trusted you new data becomes. But don't worry: inconsistent data is also good. It helps you find outliers that suggest your assertions might need to be revisited.

Achieve - Take action using the new data you created (obviously). But you also need to share analytic discoveries including data and procedures. Reuse these assets within your enterprise to build more advanced analytics. Continue revisiting and applying analytics to new data to build greater accuracy and trust.

My goal in the next few weeks is to continue differentiating Analytics from more traditional Business Intelligence while showing how solutions that blend both disciplines can offer some of the best insight available.

Tuesday, April 2, 2013

What is an Analytic?

**** Looking BACK to 2013 ****

I work in the rapidly growing field of Business Analytics (BA) . It is the natural evolution of Business Intelligence (BI) that we started two decades ago. Follow the links above and you see that BA is roughly described as:

"the skills, technologies, applications and practices for continuous iterative exploration and investigation of past business performance to gain insight and drive business planning"

and

"Business analytics makes extensive use of data, statistical and quantitative analysis, explanatory and predictive modeling, and fact-based management to drive decision making."

That's a great start, but I still have trouble understanding the value of BA in the context of the reporting, spreadsheets and databases that I have used for many years. In other words, what do I get with Business Analytics that I didn't get from Business Intelligence? And how does Big Data change things?

Here is a series of slides have been working on. I hope they help.

We use BI built on top of an IT infrastructure to prepare, cleanse and conform raw data into curated SQL (rows & columns) and OLAP (cubes and slices) tables. These warehouses and datamarts provide aligned data that is generally accurate, understandable and relevant within the confines of the applications they were meant to serve.

Here's a fundamental element to remember: BI organizes, summarizes and ultimately reduces the original volume of data.

Business Analytics adds the functions and procedures to augment BI data to reflect a wider domain of understanding. It literally expands data volumes to reflect greater amounts inferred knowledge.

We explore data to find new associations. We extend data using various quantitative, qualitative and predictive functions to create new data from existing BI data. Together we get the insight that hopefully lets us make better decisions.

At least that's how things worked before the arrival of Big Data.

The players are similar but the overall process often changes in a few important ways.

Data volume is the same or higher. Aligned and curated IT data is replaced by the different sources that are Big Data. We still prepare, cleanse and prepare data although actual sources are less likely to be moved to physical warehouse files. Data stays in place while virtual warehouses and datamarts are created on demand and live only as long as necessary.

Big Data introduces a new wrinkle: Previously, IT built most of the data used by BA. Now much of the data is referenced directly by Line-of-Business users. This adds great flexibility and speed. But the data can be suspect. More on this in a moment.

Analytic functions extend virtual data to reflect better understanding. The parameters to these functions are more diverse since data is less structured than before.

I define Business Analytics for Big Data as an extension to traditional analytics that offer an important new feature: Virtual data is related to existing BI data to improve alignment, accuracy, understanding and relevance. This step cannot be overemphasized. Without relating new data to old, it becomes much harder to trust diverse but uncurrated big data sources.

Wednesday, February 20, 2013

Where ETL is Headed

**** Looking BACK to 2013 ****

Extract Transfer Load (ETL) has been with us since the beginning of Business Intelligence.

For decades ETL has been the IT capability responsible for producing ODS/EDW/MOLAP datasources for line-of-business consumption. This important function creates the tables and datamarts that enable traditional reporting and analysis. ETL includes functions like data cleansing, normalization and validation.

The horizontal (wide ranging) domain of big data means associated information has less consistency and alignment than traditional corporate data. So much so, that many pundits declare ETL will become extinct as we build out more big data solutions.

I gently suggest that ETL isn't going away. Instead, many related activities, like finding, filtering, organizing and categorizing, will be pushed up to the line-of-business user. Armed with new tools and techniques, this domain expert will enter the realm of self-service ETL.

The role of IT changes too. Large horizontal information sources are now provisioned in a "Data Landing Area" that stages unstructured, semi-structured and unmodelled data. Users pull data from these sources to facilitate their specific analysis requirements.

Does this mean IT is out of the ETL business? Not at all. When a Line-of-Business user has identified topics of interest, outliers or new data correlations, they push the metadata description of their analysis back to IT who prioritizes requests based on actual use, and then proceeds to create the most needed enterprise-ready (cleansed, normalized and validated) datasets for more general use.

The result: An efficient workflow serving a larger number of users with greater agility and accuracy.

Tuesday, February 12, 2013

The Rise of the Column Store

**** Looking BACK to 2013 ****

NoSQL and related column-oriented databases are offered as a solution for datasets with requirements for high volume, velocity, variety and veracity.

The underlying principles are sound. Database design follows function, which it s say, data is organized optimally for problems at hand.

Perhaps the greatest advantage: Data is left 'as-is' - or at least - minimally processed. Contrary to conventional wisdom, this uncategorized and partially cleansed data become usable in a wider variety of applications.

The downside: Data is not always properly normalized meaning records can overlap, be duplicated and are missing altogether.

In the next post, I will examine the NoSQL behaviors that make this type of datasource a winner for Big Data analysis.

Monday, February 11, 2013

Traditional Business Intelligence

**** Looking BACK to 2013 ****

This simple methodology has served the needs of Business Intelligence for decades.

As datasets increase in size, these traditional BI systems are challenged by:

Performance - Many relational operations require all records to be processed. Joining or summing a billion records is problematic even with Hadoop.
Relevance - ETL often defines how data will be used. ad hoc or new types of analysis will mean the data needs to repurposed and transformed.
Storage - Storage is cheap. But saving several copies of a dataset each time it is repurposed is simply not possible.

Tomorrow, I'll compare these traditional BI methodologies with those that are energizing the Big Data world. Can they coexist? And what are strategies for BI as Big Data evolves further. The answer may surprise you.

Friday, January 25, 2013

Authoritative to Actionable

**** Looking BACK to 2013 ****

It's Magic Bullet time! If you remember on thing that I write this year, let it be this:

Business Runs on Columns

Columns equal refinement.
Columns equal clarity.
Columns equal understanding.

Our love for reports, data cubes and particularly spreadsheets is a reflection of this business reality.

A primary goal for anyone doing big data analysis: Transform some of your conclusions into tables and columns. Such data more reusable. It is also very convincing.

The challenge: Big Data doesn't seem to fit nicely into columns. In fact, most unstructured data seems to be the polar opposite.

Over the next few weeks, I'll discuss methods to help you effectively promote your brilliant conclusions using good old fashioned columns!

Just remember: Columns show you understand your data.

Monday, January 7, 2013

Big Data is so Last Year

**** Looking BACK to 2013 ****

Let's all agree: Last year's Big Data is this year's Better Understood Data.

Volume, velocity, variety and veracity (value) are as important as ever. The tools and needed to handle these types of data will simply become more mainstream. But how much has really changed?

Let me illustrate by comparing traditional Business Analytics with the new world of Big Data:

Traditional Approach	New Big Data Approach!
SQL Query	SQL Query
Tables	Tables
Columns	Columns
Rows	Rows
Filters	Filters
Calculations	Calculations
Search	Search
Load	Load
Transform	Transform

I present this somewhat whimsical comparison to make a point: Big Data does not have to change our fundamental organization of data. We can leverage existing principles to work with data as we always have. The devil is in the details.

There are more than 50 Big Data management products on the market today. We can expect that number to at least double in 2013. Each vendor offers different setup, query and storage methods resulting in yet more implementations options. Given the choices, which schema, query and storage strategy will win?

I predict the Big Data market will eventually coalesce around a relative few schemas and processing architectures. NoSQL and Hadoop (both batch and real-time) will remain. But the safest prediction I can make is: SQL Query, Tables, Columns, Rows, Filters, Calculations, Search, Load and Transform will remain in the Business Analyst's Big Data toolkit.

I know my focus for 2013 too: Help clients and colleagues become comfortable with Big Data knowing that their current skills and methodologies will serve them well as they strive to understand their data better than ever before.