This content, written by Scott Hoover, was initially posted in Looker Blog on Feb 25, 2014. The content is subject to limited support.
The difference between big data and smart data comes down to size versus application. Big data is a buzz word for "lots of data." However, what one does with all of that data isn't entirely obvious from the term. Smart data, on the other hand, is what one does with their data—irrespective of size.
Taking a smart-data approach, one might recognize that bigger isn't always better. For a predictive model, will a simple random sample suffice? What's the marginal impact on our model's accuracy when relying on 5 million rows versus 10 billion rows? Any worthwhile statistician will tell you that the marginal impact is negligible.
Smart also means considering cost. While the hardware component of parallel computing solutions continues to go down, the time—and therefore the cost—associated with writing such a job using Hadoop can be considerable, particularly as model complexity increases.
For companies turning to smart data, accessibility of data and the ability to quickly execute in the face of complexity are important and perhaps competing factors. If an emerging company recognizes that it can build a great recommender platform relying on little more than an SQL backend and Python, then they may have saved themselves a lot of time and money when considering the big-data alternative.
Does this mean big data is dead? Hardly. Getting a full picture of user behavior is critical, and big data plays a key role. If I wanted summaries regarding user behavior broken out by some demographic or geographic attribute, why discard useful data? Go big! If, however, my machine learning algorithm can tell me everything I need using a modest data set without having to write a MapReduce job, then why not save myself time and money?
Any economist will tell you that, in most decisions, there are tradeoffs. However, approaching data science intelligently doesn't necessarily mean jettisoning the notion of big data. It just means knowing when to pull out the Swiss army knife instead of a chainsaw.