SmarterData.ca: February 2013

Wednesday, February 20, 2013

Where ETL is Headed

**** Looking BACK to 2013 ****

Extract Transfer Load (ETL) has been with us since the beginning of Business Intelligence.

For decades ETL has been the IT capability responsible for producing ODS/EDW/MOLAP datasources for line-of-business consumption. This important function creates the tables and datamarts that enable traditional reporting and analysis. ETL includes functions like data cleansing, normalization and validation.

The horizontal (wide ranging) domain of big data means associated information has less consistency and alignment than traditional corporate data. So much so, that many pundits declare ETL will become extinct as we build out more big data solutions.

I gently suggest that ETL isn't going away. Instead, many related activities, like finding, filtering, organizing and categorizing, will be pushed up to the line-of-business user. Armed with new tools and techniques, this domain expert will enter the realm of self-service ETL.

The role of IT changes too. Large horizontal information sources are now provisioned in a "Data Landing Area" that stages unstructured, semi-structured and unmodelled data. Users pull data from these sources to facilitate their specific analysis requirements.

Does this mean IT is out of the ETL business? Not at all. When a Line-of-Business user has identified topics of interest, outliers or new data correlations, they push the metadata description of their analysis back to IT who prioritizes requests based on actual use, and then proceeds to create the most needed enterprise-ready (cleansed, normalized and validated) datasets for more general use.

The result: An efficient workflow serving a larger number of users with greater agility and accuracy.

Tuesday, February 12, 2013

The Rise of the Column Store

**** Looking BACK to 2013 ****

NoSQL and related column-oriented databases are offered as a solution for datasets with requirements for high volume, velocity, variety and veracity.

The underlying principles are sound. Database design follows function, which it s say, data is organized optimally for problems at hand.

Perhaps the greatest advantage: Data is left 'as-is' - or at least - minimally processed. Contrary to conventional wisdom, this uncategorized and partially cleansed data become usable in a wider variety of applications.

The downside: Data is not always properly normalized meaning records can overlap, be duplicated and are missing altogether.

In the next post, I will examine the NoSQL behaviors that make this type of datasource a winner for Big Data analysis.

Monday, February 11, 2013

Traditional Business Intelligence

**** Looking BACK to 2013 ****

This simple methodology has served the needs of Business Intelligence for decades.

As datasets increase in size, these traditional BI systems are challenged by:

Performance - Many relational operations require all records to be processed. Joining or summing a billion records is problematic even with Hadoop.
Relevance - ETL often defines how data will be used. ad hoc or new types of analysis will mean the data needs to repurposed and transformed.
Storage - Storage is cheap. But saving several copies of a dataset each time it is repurposed is simply not possible.

Tomorrow, I'll compare these traditional BI methodologies with those that are energizing the Big Data world. Can they coexist? And what are strategies for BI as Big Data evolves further. The answer may surprise you.