posted by Mosaic Data Science
Data scientists are a scarce commodity, and are likely to remain so for years to come.[i] At the same time, data science can create a substantial competitive advantage for early adopters who make the best use of their scarce data-science resources.
The benefits of data-driven decision making [DDD] have been demonstrated conclusively. Economist Erik Brynjolfsson and his colleagues from MIT and Penn’s Wharton School conducted a study of how DDD affects firm performance (Brynjolfsson, Hitt, & Kim, 2011). They developed a measure of DDD that rates firms as to how strongly they use data to make decisions across the company. They show that statistically, the more data-driven a firm is, the more productive it is. . . . One standard deviation higher on the DDD scale is associated with a 4%-6% increase in productivity. DDD also is correlated with higher return on assets, return on equity, asset utilization, and market value, and the relationship seems to be causal.[ii]
This paper explains how management can make the most of data-science resources and opportunities, to achieve data-driven competitive advantage.
Data-science projects are investments. Like any other investments, some are more rewarding than others. And like other investments, they generally come in portfolios. You shouldn’t pick data-science projects independently, because they often have common requirements for data, skills, and infrastructure that the organization currently lacks. Careful gap analysis may reveal opportunities to enable several projects by filling a common set of gaps, thereby making the projects as a group more attractive than they would otherwise appear individually (counting their costs repeatedly). Thus choosing an optimal portfolio of data-science projects is itself an optimization problem, a kind of data-science problem. In part because an organization’s goals (objective functions) and resources (constraints) change frequently, this particular optimization problem is a good candidate for a greedy multi-period solution, where you attack the most valuable projects first, and re-evaluate each time a project nears completion.
Opportunity cost is a second reason to treat data science as a portfolio-optimization problem. When data scientists approach data-science projects one at a time, it’s all too easy for them to indulge their analytical perfectionism by devoting disproportionate resource to achieving the best possible project outcome. This may maximize one project’s value at the cost of depriving other projects of adequate resources, so that the total value of the project portfolio is not maximized—even if the portfolio is well chosen. Analytics managers should be explicit with their teams that the goal is not to perfect any single project, but to maximize the value of the whole project portfolio; and management should structure their teams’ reward systems accordingly.
Remarkably, data scientists often struggle to recognize that every data-science project is an optimization problem.[iii] In business the objective function is usually expected value, but sometimes expected value is and should be traded away to avoid the possibility of catastrophic losses. We have already remarked one way such losses can occur: devoting so much resource to one project that another valuable project is delayed or unimplemented. Resource exhaustion can also occur within a single project. In such cases a theoretically optimal solution requires more resource to achieve than the organization is willing or able to provide. If this fact is not discovered until the project is well underway, a good outcome may be out of reach. Framing a project in optimization terms underscores these issues and helps data scientists and their managers agree when a project is complete (or should be abandoned).
Let’s consider two examples. Data-science problems often involve classification, and data scientists routinely describe such problems as classification problems. Classifying customers by degree of risk of churn is an example. The purpose of this classification is not to avoid all churn (in fact we’d probably like to rid ourselves of unprofitable customers!), but to extend our relationships with profitable customers for as long as possible, or (more directly) to maximize the expected value of each customer relationship. When the problem is framed in optimization terms, it’s much easier to recognize that we may want to adjust the classification model so that the decision we premise on the classification achieves our business goals as much as possible.
Data-science problems can also involve prediction. For example, rail carriers are deploying sensors that detect abnormally hot axle bearings (so called “hot boxes”) [iv], and using the data to predict axle failures. The practical uses of these predictions include
These several uses each has costs and benefits, and the overall business goal is to make all such decisions based on the prediction model in a way that maximizes net expected benefit, perhaps subject to avoiding catastrophic losses associated with axle failures in transit.
It’s important to realize that the goal in the second example is not to achieve the highest possible prediction accuracy. There is no such thing. Rather, there are several possible measures of accuracy, and none of them is inherently preeminent—except as a matter of tradition (which favors the ordinary least squares accuracy criterion). Gauss, who invented ordinary least squares, admitted that he chose this criterion simply because he found it convenient mathematically. Otherwise, he said his choice was “totally arbitrary.”[v] Where (as here) underestimation has disproportionate adverse consequences, it’s possible to use an asymmetrical penalty function in the prediction algorithm, so that the algorithm penalizes underestimation according to its relative cost. Regardless of how the concept of accuracy is formalized in the math, however, the goal is not to maximize accuracy but to achieve the best possible practical and economic outcomes, within the context of the resource constraints imposed by the larger portfolio-optimization problem.
We are accustomed to the idea that a scientific law is valid without regard for time or place. In practice, this idea can become an unchecked assumption, rather than a carefully validated conclusion. Inexperienced or overworked data scientists may discover and exploit a useful relationship without adequately validating it, perhaps using a model form that does not penalize model complexity to avoid overfitting. The result can be that the model works well initially, leading management to devote considerable resources to implementing the model. Later the fully implemented model fails to generalize, delivering far less value than expected. (The Google Flu model is a good example.)
This possibility creates several imperatives. Make sure your data-science team
The interesting question is what “appropriately” means in each case.
The danger that a model fails to generalize is particularly acute when the model uses social-science input variables. Such variables are risky in two ways. First, they can be poorly operationalized, so that over time what they measure drifts. Second, the relationship between the input variables and the target variable can change. For example, consumer goods such as clothing can enter the market as superior or normal goods, but become inferior goods as they go out of style. As a result, the relationship between income level and quantity purchased can reverse direction over time. A pricing model built on the assumption that a product is a normal good can maximize revenue until the product becomes an inferior good, at which time the model would fail.
To make sure your social-science models deliver value for as long as you expect, you (or your data scientists) need to know your data. Make sure input variables within your control are properly and consistently operationalized, so that what they measure is what you expect, and does not change over time. When you don’t control how input variables are operationalized (for example, if you acquire input variables from a third party), and the third party does not demonstrate consistent, scientifically sound operationalization, you should monitor variable validity (see below).
More fundamentally, your data-science team should study how much peer-reviewed social science exists linking your input variables to your target variables, and how conclusive this science is. When little real science has been done, or when the science is not conclusive, there is a real chance that the input variables’ relationship with the target variables will change. Again, in such cases you should not take model durability for granted, and should instead make sure your production model guards against possible changes in the relationship between input variables and target variables.
In the process of developing a data-science model, data scientists measure input data and model quality in various ways. For example, they fit input-variable data to parametric statistical distributions, measure classification models’ error rates, and fit prediction models’ residuals to parametric distributions. When the prototype is implemented in a production model, these measures become assumptions underlying the production model. The production model should periodically re-compute the same measures to verify that the model still fits its inputs, when there is a significant chance that the assumptions may fail over time. Where a model’s form will not change, but its parameters may change, it can be appropriate to have the model re-compute its parameters periodically, to prevent the model’s going stale.
Data scientists often report that they spend most of their time gathering and preparing data rather than analyzing it and building models. When data scientists are scarce, the tasks that only they can perform (analyzing data and building models) are likely to be the bottlenecks in your data-science development and delivery processes. IT personnel with extract-transform-load (ETL) skills can accomplish these tasks. It’s important however to have data scientists specify certain steps in the ETL process, attending particularly to how null, invalid, and outlier values are handled. The technical requirements of data science in these areas differ from those in traditional business intelligence (BI). For example, in BI outliers may be treated as invalid values because an ETL engineer does not understand the role of outliers in statistical analysis and has made unfounded assumptions about the range of valid values.
As I observe above, accuracy is always relative to one’s choice of penalty function. There are two basic ways to account for the cost of inaccuracy when choosing a fitness criterion.
For example, in a customer-churn analysis, we can use a classification algorithm that minimizes its overall frequency of misclassification (weighing all errors equally while building the classification model). We would then measure the frequency of each kind of misclassification under the resulting model, and compute the expected value of a marketing campaign based on the model. Instead, we could use a classification algorithm that weighs each error according to its cost, and minimizes cost in constructing the classifier. It is still possible to compute error rates for the second option. Some of the error rates may be higher than in the first option, even though the second option may produce better economic outcomes.
A second sense in which management should account for the cost of inaccuracy involves projecting the additional benefit that would accrue from having a more powerful model. Additional model power can be had by adding useful input variables to the model, or by improving model form.
You are probably familiar with the activities and costs of additional input variables: acquisition, cleansing, metadata generation, and storage. Additional information can improve model power. The interesting question is when the tradeoff is favorable. Techniques exist for computing the expected value of additional information. See Mosaic Data Science’s white paper “The Value of Information in the Age of Big Data” for an example value-of-information computation in a case where the decision maker can spend a certain amount of money to eliminate her or his uncertainty about an input variable’s value. The key point is to realize that you can often be objective about whether acquiring additional input data is worth the expense.
Delay in delivering a production model usually has at least one of three effects. First, it delays realizing the model’s expected benefits. Second, it gives the data-science team time to improve model power. Third, it gives the data-science team time to deliver other models. One must balance the first (a cost) with the second and third (benefits) to decide whether delayed delivery is appropriate. Remember that it is always possible to improve a model’s power after delivering the first version of the model. This makes an agile approach to data-science model development and delivery extremely useful. Don’t let perfection be the enemy of the good.
One of the sometimes hidden consequences of delay is loss of credibility with stakeholders. This can have far-reaching consequences for the amount of support the organization provides to the data-science function. It can be useful especially for an inexperienced data-science function to “start small” and gain credibility (as well as experience) quickly, to ensure future data-science projects are well supported.
Data science has a long, highly technical learning curve. An inexperienced team with limited background has two ways to shorten its learning curve. The first is training. For example, Coursera offers a series of online classes designed to train data scientists in specific computational techniques and tools. Mosaic also offers training, notably in data science design patterns. Design patterns are architectural patterns that expert data scientists have learned from experience to employ routinely in assembling solutions to large-scale data-science problems. Often design patterns are poorly documented, and are not taught at all in traditional technique-oriented courses. This is unfortunate, because learning design patterns is a legitimate shortcut towards learning to think like an expert.
The second way to shorten a team’s learning curve is to supplement the team’s expertise temporarily with data science experts. Expert coaching can help a team learn hands on, lowering the risk of early failures that could compromise the team’s credibility, and accelerating delivery. Given that many data-science projects return an order of magnitude return on investment when properly executed, hiring outside coaching can pay for itself many times over. Mosaic’s data scientists have at least a decade of industry experience, much or all of it in mixed teams. We can deliver training and coaching activities that will move your team up the learning curve quickly, gaining credibility and delivering substantial results in the process.
[i] Paul Barth and Randy Bean, “There’s no Panacea for the Big Data Talent Gap,” Harvard Business Review HBR Blog Network (November 29, 2012), http://blogs.hbr.org/2012/11/the-big-data-talent-gap-no-pan/.
[ii] Fost Provost and Tom Fawcett, Data Science for Business: What You Need to Know About Data Mining and Data-Analytic Thinking (O’Reilly 2013), pp. 5-6.
[iii] We are not the first to remark this curiosity. See e.g. Provost and Fawcett ibid, p. 88. The authors assert that optimization is “fundamental” to data science, and report that data scientists overlook the necessity of optimization “surprisingly . . . often.”
[v] H.T. Nguyen and G.S. Rogers, Fundamental of Mathematical Statistics, Volume II: Statistical Inference (Springer-Verlag 1989), pp. 296-297.