**** Looking BACK to 2012 ****
Working with large data volumes is never easy. Here are three strategies that can help you.
- Keep your eye on context. Where data comes from, who produced it and when it was created are examples of attributes that assist many types of analysis. Traditional Extract Transfer and Load (ETL) systems take pride in their ability to remove some of these noisy attributes. But any data attribute could eventually be the key to unlocking hidden data correlations. Adding data as you process incoming streams should be expected.
- Some attributes are better than others! Time and location are examples of broad filters that can be used to reduce datasets to more manageable numbers that can be subsequently analyzed deeper than is possible across a larger domain. In terms of volume, think reduce first with broad filters and then apply deep analysis with as many attributes as you can tolerate.
- Big data analysis gets better the more you work it. You start with a good hypothesis. By retesting your analysis with both supporting and conflicting data, you arrive at a more complete solution.
Let's look at a very simple example that pulls all these ideas together.
You look at a set of customer feedback comments. Attributes include rating, date, product, comments, and author name. Assume you have many hundreds of thousands of items and you are keeping all attributes.
You look at a set of customer feedback comments. Attributes include rating, date, product, comments, and author name. Assume you have many hundreds of thousands of items and you are keeping all attributes.
- The first filter you apply is by date to remove comments about obsolete products. This reduces your data to thousands of items. A good broad filter gets you down to workable datasets fast.\
- You look at low ratings first. Scanning product names and author, you see a rather wide range. But on a hunch, you sort comment text. You see 700 comments are the same (you use a Baysian filter to filter text that is identical with minor differences). You identified an adhoc discriminator: identical comments from different people about different products likely come from the same source and need to be discounted.
- Finally you look at the referenced product name. You apply a simple text extractor to look for product names you care about. A comparison of remaining items shows that 50% of comments refer to products that differ from the product name attribute associated with the text. You quickly infer that there could be a problem with the ETL process. This simple revisiting of data shows how your ETL may need some work.
Just remember: this scenario is simple. More tools and techniques are needed to work with large datasets.
Stay tuned :)
No comments:
Post a Comment