posted by Mosaic Data Science
No one pours new wine into old wineskins. Otherwise, the wine will burst the skins, and both the wine and the wineskins will be ruined. No, they pour new wine into new wineskins.
Executives considering how to apply data science to their organizations often ask Mosaic about “relevant industry experience.” Historically this has been a legitimate question to aim at a management consultant. Each industry has had its own set of best practices. A consultant’s responsibility has generally been to provide expertise about these practices and guide the customer in applying them profitably. For example, two decades ago a fashion retailer might reasonably ask a business consultant about her or his expertise with the Quick Response method, then a best practice for fashion retail.[i] Posing the same sort of question now to a data scientist assumes that industry experience continues to play the same role in data science that it has historically in management consulting.
In this post we explain why the assumption about industry experience is outdated—why often industry experience detracts from the best possible application of data science to business decisions that merit scientific analysis. The convergence of four trends has resulted in data science expertise displacing industry expertise as the critical qualification, even in highly technical industries such as healthcare. Those trends are
The result of the convergence of these trends is that a data scientist can very quickly
We now consider each trend in turn, and illustrate its impact with recent examples from several industries.
Over the last two decades, most peer-reviewed research and professional publications have gone online. While online journals frequently require subscriptions, it is no longer necessary to visit a library to access journals. Moreover, most journals make article abstracts freely available. Google scholar tracks how often each article is cited, which is a quick and fairly reliable indicator of article importance (at least when time since publication is accounted for!). Likewise, Google has now accumulated over 10% of the world’s literature online. All U.S. patents, federal, and state case law are available on Google. Many technical books are available in softcopy, and Amazon sells more e-books than printed books.
All of these changes, coupled with search engine technology, mean a competent researcher can very quickly survey the literature relevant to a specific data science challenge. Such a survey reveals which variables and algorithms appear in state of the art methods. For example, one of Mosaic’s data scientists recently surveyed in a single afternoon the entire literature (about 100 articles) related to a specific social-science problem. The survey discovered the variables and model structures in the state of the art approach to the problem, as well as the areas of active research. In another case a Mosaic data scientist surveyed the hurricane-modeling literature, to discover how weather researchers currently model hurricane behavior. That survey took a couple of days, and the researcher soon improved on the state of the art by discovering a hitherto unrecognized variable with far more predictive power than variables in published models.
The importance of the universal availability of research and professional literature extends beyond merely educating the data scientist about a given domain’s variables and models. Researchers now regularly mine research using text-analysis models to discover hidden domain knowledge. Examples abound. For instance, data scientists built a relational dependency network (RDN) model based on over a half-million disclosure documents in the National Association of Securities Dealers (NASD) Central Registration Depository. The RDN model discovered 11 variables that reliably identify problem brokers and disclosure types, as well as re-discovering important patterns of risk (past problems predict future problems, and problems such as fraud and malfeasance tend to run in branches of brokerages) known to securities experts.[ii] Likewise, data scientists now use literature mining regularly to discover previously unknown relationships among biomedical concepts such as proteins, genes, drugs, and diseases.[iii]
In short, the general availability of research and professional literature lets a data scientist quickly learn which variables an industry considers important for a given problem, and also to mine the relevant literature for previously undiscovered relationships. These activities combine with the other trends we outline below to create new knowledge—when the data scientist approaches the problem as an industry novice, aware that he or she must review and perhaps mine the literature. On the other hand, subject-matter experts must first become willing to set aside their expertise, to leverage the research and professional literature in hopes of improving on the state of the art. And subject-matter experts often lack the technical expertise necessary to use text mining to extract knowledge from a domain literature.
In the early days of expert systems, an expert system was considered a success if it performed as well as a human expert in a field requiring years of training. The classic example is MYCIN, an artificial intelligence medical application developed in the 1970s at Stanford University. MYCIN recommended appropriate treatment in 69% of cases, slightly improving the performance of human experts facing the same diagnostic and therapeutic tasks. Current generation expert systems significantly improve on expert human judgment. For example, decision support systems (DSSs) created by Mosaic to guide certain air-traffic control decisions improve on the baseline unguided decisions by reducing delays up to 78%, even though Mosaic’s data scientists have never served as air traffic controllers. Another expert system designed by a Mosaic data scientist reduced two years of expert programmer work to a single person day of non-technical business analyst effort, even though the data scientist had only a general knowledge of the underlying business domain.
Even in MYCIN’s era, other classes of algorithms already far surpassed human judgment in their areas of application. This was (and remains) especially true in optimization, where the number of possible solutions to many business problems is practically infinite. For example, there are 50! = 3×1064 possible ways to route a delivery truck that must visit 50 stops in a day. A human expert cannot hope to sort through the alternatives to determine which is most cost effective—especially if the relevant costs go beyond mere mileage to include wait times at stoplights, penalties for late deliveries of priority packages, the risk of accidents in high-risk areas, etc. This is the problem that UPS’ ORION decision-support system (DSS) tackles. It improves on experienced truck drivers’ intuitions enough to save UPS 35 million miles per year. (And it contains some 85 pages of mathematical formulas!)[iv]
Likewise, even decades ago human experts could not quantify the risk related to random phenomena such as severe weather events or fatal diseases nearly as well as the best scientific models. Optimization in the presence of multiple sources of uncertainty has always been a practical impossibility for domain experts. Mosaic data scientists have contributed to the state of knowledge in this area for many years.
Over the last two decades there has been a striking maturation of data science techniques generally. These techniques are not only published in the literature, but are widely available in prepackaged tools or libraries, often at no charge. (The R language’s CRAN repository has about 5,400 packages, for example.) The result is a repertoire of analytical techniques available to today’s data scientist that are far more advanced than those available to researchers—let alone consultants—in the recent past. Here are some prominent examples:
Text mining leaped forward in the ‘90s when statistical models displaced or combined with grammatical models to create powerful search, information extraction, and sentiment analysis tools capable of extracting knowledge from millions of documents—far more literature than any expert might read in a lifetime.[v]
Dimensionality reduction techniques have recently evolved to support datasets containing thousands of variables.[vi] These techniques let data scientists evaluate far more possible combinations of variables than any expert might contemplate, let alone evaluate.
Recently developed analytical techniques that accurately model nonlinearities let a data scientist create more powerful prediction and classification models than were possible in the near past. These methods include
Nonlinearities are difficult for subject-matter experts to intuit or recognize, and yet nonlinearities are often a key to obtaining high model power.
Computational Bayesian sampling techniques (such as Gibbs sampling) have recently developed enough to make Bayesian inference (which has clear advantages over classical statistical methods that form the basis for much traditional industry expertise) a practical reality.[xii]
Ensemble methods based on techniques such as boosting[xiii], bagging[xiv], and Bayesian model averaging[xv] let a data scientist combine many (perhaps hundreds) of weak models into a single powerful ensemble model.
The result of the recent proliferation and broad publication of advanced analytical techniques, combined with the other trends this document cites, is that a data scientist can build a more powerful analytical model tailored to a specific organization’s exact place in the data landscape, including its customer base, market niche, cost structure, technologies, etc. Until recently the best a consultant could do was bring to bear a far less analytically powerful and far less specific model developed by a subject-matter expert such as a business-school professor. Models that account for your organization’s specifics will always be at least as powerful as canned, general-purpose models, and usually will be substantially better.
Computers continue to obey Moore’s law, which predicts that in a certain sense, computing power doubles every two years. Computer networks have also become faster by several orders of magnitude, and computer storage systems likewise have increased their capacity exponentially. The result of these trends is that even modestly sized businesses can afford to purchase or rent (in the cloud) enough computing power to run sophisticated predictive and prescriptive analytics, decision automation systems (DASs), and telematics systems in real time, to model very specific phenomena (such as the behavior of individual customers, motor vehicles, or stocks). These systems frequently far surpass in economic productivity the decisions of domain experts.
For example, Amazon.com used to employ professional book reviewers, who wrote book recommendations for wide audiences of Amazon customers. Eventually Amazon developed a DAS that recommends books to every Amazon.com visitor, based on patterns of previous purchases, page visits, behavioral similarity to other visitors, etc. The DAS easily outperformed the reviewers, and now accounts for one third of all Amazon book sales. Netflix operates a similar DAS that accounts for 75% of new orders.[xvi] Likewise, fleet operators such as UPS and the City of St. Louis Missouri use the combination of telematics and predictive analytics to predict and prevent mechanical failures on specific vehicles far more economically than is possible by relying on the general-purpose knowledge of expert human mechanics and vehicle inspectors.[xvii]
In these and similar cases, the combination of highly case-specific data and models, powerful computing and telecommunications technology, and advanced analytical techniques lets data science far surpass domain experts possessing more general knowledge and data. Domain experts simply can’t collect, sort through, and analyze data as quickly and accurately as a state of the art model armed with case-specific data.
The final trend that favors analytical expertise over industry knowledge is exploding data availability. Organizations can now collect, download, or purchase an array of incredibly detailed historical datasets. The most notorious example is data describing the online behavior of individual consumers—but there are many others. For example, the computer scientist Oren Etzioni founded Farecast (now part of the Bing search engine), which scrapes the Web for various predictors of airline price data, and uses over 100 billion data points to recommend the optimal time to purchase an airline ticket. Many sentiment analysis tools now on the market likewise scrape the Web to learn the public’s attitudes about specific products, services, and organizations. Several firms sell sentiment-analysis results to stock traders; another sells motor-vehicle traffic data (for traffic near retail stores) to investors, who use it to predict sales levels (and hence stock prices) of the retailers.[xviii] FICO (the inventor of credit scores) recently created its Medication Adherence Score, which uses consumer behavior data to predict an individual patient’s propensity to adhere to a medication regimen. Data science consultants employed by Aviva used consumer lifestyle data to identify consumers at high risk for a variety of serious illnesses.[xix]
The competitive advantage of these and similar firms lies with their data collection and analysis expertise—not their familiarity with traditional industry models. Mosaic’s list of online data sources suggests how frequently a data scientist can enrich an organization’s internal data with detailed third-party data to create powerful, case-specific models of consumer, competitor, or market behaviors. Traditional industry expertise often fails to anticipate or account for the possibilities represented by these newly available third-party datasets.
We see that universal availability of industry literature, extremely powerful analytical techniques and hardware infrastructure, and exploding data availability combine to let a data scientist develop models having far more specificity and power than what traditional domain experts can achieve with their general-purpose domain knowledge. Here’s how the book Big Data puts it:
The biggest impact of big data will be that data-driven decisions are poised to augment or overrule human judgment. . . . The subject-area expert, the substantive specialist, will lose some of his or her luster compared with the [data scientist], who [is] unfettered by the old ways of doing things and let[s] the data speak. This new cadre will rely on [models] without prejudgments and prejudice, trust[ing] the aggregated data to reveal practical truths. We are seeing the waning of subject-matter experts’ influence in many areas. . . . The pioneers in big data often come from fields outside the domain where they make their mark. They are specialists in data analysis, artificial intelligence, mathematics, or statistics, and they apply those skills to specific industries. The winners of Kaggle competitions, the online platform for big-data projects, are typically new to the sector in which they produce successful results. . . . To be sure, subject-area experts won’t die out. But their supremacy will ebb. From now on, they must share the podium with the big-data geeks. . . . Expertise is like exactitude: appropriate for a small-data world where one never has enough information, or the right information, and thus has to rely on intuition and experience. . . . But when you are stuffed silly with data, you can tap that instead, and to greater effect. Thus those who can analyze big data may see past the superstitions and conventional thinking not because they’re smarter, but because they have the data. (141-143)
What matters in a data scientist, then, is not industry knowledge, but the ability to do data science. We remark elsewhere that in our experience, the practice of data science is broadly interdisciplinary. This is true both for the individual data scientist, and for the data science consulting team. A good data scientist
A good data science team does all of the above, and includes a diversity of analytical, business, and organizational expertise appropriate to a project’s customer and domain.
So, when you’re shopping for a data science consultant, look for the above qualities, rather than domain expertise. Regardless of the fields of application where a good data scientist has made their mark, her or his project work will demonstrate the above desiderata, translating them consistently into substantial returns on investment. Whether you’re selling books online, saving lives in the air, or discovering profitable linkages between protein molecules, once you translate the business problem into data, the rest of the exercise is mostly computation. And the data science design patterns are the same, regardless of where you apply them.
[i] J.H. Hammond, “Quick Response in the Apparel Industry,” Harvard Business School Note N9-690-038 (1990).
[ii] Jennifer Neville and David Jensen, “Relational Dependency Networks,” Introduction to Statistical Relational Learning, Lise Getoor and Ben Taskar, eds. (MIT Press, 2007), pp. 258-260.
[iii] See for example
(visited April 5, 2014).
[iv] Another prominent area of application where quantitative methods have for decades surpassed the performance of traditional subject-matter experts is securities trading. James Owen Weatherall, The Physics of Wall Street: a Brief History of Predicting the Unpredictable (Mariner Books, 2014).
[v] John Elder et al, Practical Text Mining and Statistical Analysis for Non-Structured Text Data Applications (Elsevier, 2012).
[vii] John Shawe-Taylor and Nello Cristianini, Kernel Methods for Pattern Analysis (Cambridge, 2009).
[viii] Daphne Koller and Nir Friedman, Probabilistic Graphical Models: Principles and Techniques (MIT, 2009).
[ix] B.D. Ripley, Patttern Recognition and Neural Networks (Cambridge, 1997).
[x] James C. Spall, Introduction to Stochastic Search and Optimization (Wiley-Interscience, 2003).
[xi] Myles Hollander et al, Nonparametric Statistical Methods (Wiley, 2014).
[xii] William M. Bolstad, Understanding Computational Bayesian Statistics (Wiley, 2010).
[xiii] Robert E. Schapire and Yoav Freund, Boosting: Foundations and Algorithms (MIT, 2012).
[xiv] Zhi-Hua Zhou, Ensemble Methods: Foundations and Algorithms (Chapman & Hall/CRC, 2012).
[xv] Gerda Claeskens and Nils Lid Hjort, Model Selection and Model Averaging (Cambridge, 2010).
[xvi] Viktor Mayer-Schonberger and Kenneth Cukier, Big Data (Eamon Dolan, 2014), pp. 51-52.
[xvii] Big Data, pp. 59, 128; “Telematics Usage in the Fleet Industry Survey Results” (Donlen, 2012) (visited April 5, 2014).
[xviii] Big Data, pp. 92-93, 136.
[xix] Big Data, pp. 56-57.