Welcome to Mosaic Data Science, and thanks for reading our blog! We’ll frequently opine here about various technical and managerial data science topics, so visit often.
The phrase ‘big data’ has become enormously popular in the business press. Like many business buzz phrases, it has lost much of its original meaning. More often these days when a business writer says “big data” they mean data science, or data science applied to a large data set. Some traditional-BI vendors try to capitalize on the buzz by identifying new features of their offerings as supporting “big data,” even though they work in the traditional relational-database paradigm, which big data by definition does not fit.
The phrase does have a clear (and useful) original definition. Big data is data that is too big to be stored economically in a relational database. Just what that means depends on whose budget we’re talking about, and what year. Regardless, many new data-storage technologies have been invented out of the need to store data that’s too expensive to manage with a relational database. There’s just too much of it.
Big data storage technologies fall into two groups. The noSQL systems eschew an SQL-based query interface, and typically forego many of the transaction-management guarantees that traditional databases provide. The newSQL systems embrace SQL, but try to reinvent the storage behind it without giving up most of the traditional guarantees. They start as R&D efforts, either at companies such as Google and Facebook whose business strategies depend squarely on effective management of (very) big data, or academic very large database (VLDB) research programs.
Inevitably the software vendors come along later and try to bolt SQL back onto some noSQL systems, doing violence to the original design in the process. Cloudera’s Impala and Hadapt are examples. They provide SQL interfaces for Hadoop, which was designed around the MapReduce algorithm for embarrassing parallelism. These bolt-ons turn noSQL into newSQL. One wonders if the point is to sell a cheap relational database (and less cheap services to go with it?), rather than to make Hadoop more “flexible.” If a cheap relational database is the point, why not PostgreSQL? In 2008 Yahoo ran the world’s largest (relational) data warehouse (2PB) on a hacked version of PostgreSQL. That’s pretty big.
Which brings us to the point. Many companies—most, perhaps—don’t have strong use cases for noSQL or newSQL technologies. They wish they did: traditional BI is an expensive business, and big data sure looks fun. Over time, as businesses accrue more data, and the non-traditional storage technologies mature, more of us will probably learn to say “big data” when we really mean it. But right now, the common use cases are far more likely to center on deriving big value from small data: data a company already has, tucked neatly away in relational databases (or even mere file systems). Even companies that have, or could have, big data, will not magically create value just by capturing the data, or by spinning up a Hadoop cluster. The data only delivers value when it produces actionable insight, and someone acts properly on that insight. Data is not an end in itself. And storing data—using whatever technology—is the easy part.
If there is gnosis in the world of data management, it lies in producing actionable insight, and acting properly on it. Those imperatives are the stuff that data science is made of. Data science is more than statistics, and more than creating insight. It’s about communicating the insight in a way that decision makers will embrace; and making sure the decision makers have the means, motive, and opportunity to do so. Only then does your data become an asset. Knowing what your data opportunities are, and knowing how to capitalize on them to produce big ROI: that’s the secret. If you can do it with small data, so much the better!