# Data Science Design Pattern #5:

Combining Source Variables

*posted by Mosaic Data Science*

**Variable selection** is perhaps the most challenging activity in the data science lifecycle. The phrase is something of a misnomer, unless we recognize that mathematically speaking we’re selecting variables from the set of all possible variables—not just the raw source variables currently available from a given data source.[i] Among these possible variables are many combinations of source variables. When a combination of source variables turns out to be an important variable in its own right, we sometimes say that the source variables **interact**, or that one variable **mediates** another. We’ll coin the phrase **synthetic variable** to mean an independent variable that is a function of several source variables, regardless of the nature of the function.

There are several reasons to synthesize variables.

**Known or plausible relationship.** Often a synthetic variable has a natural interpretation. The product or ratio of two variables may express naturally the idea that one variable scales up or down the other’s effect. For example, body-mass index varies with one’s waist-to-height ratio. We discover such relationships through **interviews**, **literature review,** **data-source review** (including third-party datasets), and **data visualization**.[ii]

**Dimensionality reduction.** Especially when dealing with big data, we’d like to minimize the number of independent variables in a model, so we use a **dimensionality reduction** technique that combines many source variables into a few more powerful synthetic variables.[iii] Dimensionality reduction is common in the **bag of words**[iv] approach to text mining.

**Relationship discovery.** In some cases we don’t know beforehand which functional forms best fit the data. We want to cast a wide net, to maximize goodness of fit. For small sets of source variables in polynomial models of at most third order[v] (which may contain synthetic variables that are cross-product terms of the form x_{i}^{m}x_{j}^{n}), we can use standard goodness-of-fit tests in an **exhaustive search** of the possible model forms, and test the importance of cross-product terms within the best fitting models. When we have many source variables (perhaps after enriching our data with third-party datasets), or when we want to explore a broader set of possible functional forms, we may use exhaustive search, heuristic search, stochastic search, or MapReduce to synthesize variables while constructing and evaluating models.

This post underscores some important features of the variable-selection process generally (including variable synthesis):

**A model’s power derives from all of its input variables together.**One measures the*incremental*contribution (importance) of a single variable to the model, given that certain other variables are already in the model. One does not generally attempt to measure a variable’s importance in the abstract.[vi]

**A variable’s contribution also depends on the type of model.**The same variable may contribute more or less in an ordinary linear regression than in a piecewise-linear regression, for example.

**There is a great deal of interplay between variable selection and model selection.**While there are model-independent measures of variable importance, the safest course is to measure variable contribution within the context of a given model type.

**In business data science, one must ultimately measure variable importance in economic terms.**Seeing the optimization problem implicit in a business data science problem is critical to get the right measure of variable importance. Variable A may have less predictive power than variable B by some measure, but still have more influence on the business’ optimization criterion.

## Description

Here is the general pattern one employs in variable selection:

- Discover (or synthesize) a variable to test.
- Add the variable to the current collection of variables.
- Test the collection’s modeling power.
- If you’ve reached a stopping condition (the power is high enough, or you’re out of resources), stop.
- Otherwise, return to step 1.

This procedure is often embedded in a similar procedure that tests different types of models, measuring goodness of fit or degree of optimality for each tested model. When both procedures are complex enough to involve automated search, the searches are often combined in a single algorithm.[vii]

There are many best practices that help you perform each variable-selection step intelligently, so you complete it with a minimum of effort and time. The remainder of this post is devoted to illustrating some of these practices, emphasizing variable synthesis throughout.

## How it Works

**Bare knuckles discovery of variable interactions.** Clever algorithms and unlimited computing power have not yet eliminated the advantages of human judgment in variable selection. For one thing, *even automated variable-selection techniques are only useful when the right variables are in the source data to start with*. Several approaches remain important:

**Interviews**with subject-matter experts are an important source of conjectures about interactions and mediating variables. It can be useful to familiarize oneself with and apply the knowledge-extraction techniques used by artificial intelligence pioneers.[viii] For example, it’s often necessary to give an expert specific cases to work through, to elicit their knowledge of e.g. which variables matter at all, and which interact.

- Many of the business problems one encounters in data science have been the subject of years of management science research. A cursory
**literature review**is often enough for one to surmise quickly which synthetic variables researchers have identified. For example, if you want to model human longevity, you could survey over half a century of scientific research.[ix]

- Sometimes during a
**dataset review**, merely seeing several variables occur in the same dataset will suggest that the variables may interact. For example, while surveying the metadata in the Centers for Disease Control and Prevention’s Health Indicators Database[x], you might see high cholesterol and tobacco in the same list, and recall that both contribute to hypertension individually, so that they may interact to contribute much more together than either alone.[xi]

- A variety of
**data visualization**techniques such as three-dimensional scatterplots and geospatial maps can reveal or suggest important synthetic variables. For example, Figure 1 below is an NOAA geospatial graphic of all hurricanes passing within 65 nautical miles of Cape Hatteras, N.C. since 1990.[xii] The graphic immediately suggests that most hurricanes in the North Atlantic Basin turn right, and that they only turn to a certain heading (about northeast) before straightening out. This observation suggests the approach to variable synthesis described in #5 below.

**Figure 1: Hurricanes Approaching Cape Hatteras, N.C. Since 1990**

- Sometimes one’s intuition leans towards a given
**class of models**, and that class often features a specific type of synthetic variable. For example, one may have a discrete-time process {X_{i}}, and one may wish to predict X_{i}. In such cases it can be useful to synthesize S_{n}= ∑_{i=1,n}X_{i}, and then to compute p(X_{n+1}|S_{n}, X_{n}). Even when the original process {X_{i}} lacks the (one step) Markov property, the new process {(S_{i}, X_{i})} may well have it; and p(X_{n+1}|S_{n}, X_{n}) may predict X_{n+1}much more accurately than p(X_{n+1}|X_{n}). In the hurricane example above, we start with a sequence {X_{i}} of tracks of the form X_{i}= (lat_{i}, long_{i}, dir_{i}). We can enrich the 3-tuple with two synthetic variables: the change in bearing δ_{i}= dir_{i}– dir_{i-1}, and then S_{i}= ∑_{k=1,i}δ_{k}. We can see in the hurricane tracks in #4 above that p(dir_{i+1}|dir_{i}, δ_{i}) should let us predict hurricane bearing much better than p(dir_{i+1}|dir_{i}). One can similarly construct a count of the number of consecutive tracks at which a hurricane (or an airplane!) has turned in the same direction, to help predict whether it will continue to turn in that direction at the next track.[xiii]

**Computational discovery of variable interactions.** Searching for a good combination of source and synthetic variables (starting with a finite set of source variables and a finite number of ways to synthetize variables) is a **combinatorial optimization problem**.[xiv] There are several families of computational techniques for synthesizing variables during combinatorial search:

- The increasingly common accrual of
**high-dimensional datasets**(those having many variables) creates a variety of analytical challenges. For example, whereas traditional datasets have far more records than variables, for some high-dimensional (wide) datasets the opposite is true. Techniques such as spike-and-slab variable selection have been developed that can select important source variables from such datasets gracefully.[xv] Another way to manage high-dimensional data is to synthesize from it a small number of synthetic variables that capture most of the variation in the source data. This is**dimensionality reduction**.[xvi]**Principal components analysis**(PCA) is an important, frequently used dimensionality-reduction technique. It produces a set of uncorrelated variables called**principal components**. Each succeeding component accounts for the largest possible amount of remaining variation in the original data. As a result, the first few components often account for most of the source data’s variability, letting you use a model having far fewer variables than the originals.**Singular value decomposition**and**factor analysis**are related methods.

- When the set of possible synthetic variables is sufficiently small, or one’s computing resources are sufficiently large, one can evaluate all possible combinations of source and synthetic variables. Combinatorial explosion severely limits the utility of
**naïve exhaustive search**, especially when dealing with high-dimensional data. Tree algorithms such as**branch and bound**[xvii] can exhaust the possibilities far more efficiently than brute-force exhaustive search.

- A
**heuristic search algorithm**explores a subspace of the search space by following simple rules of thumb that (one hopes) lead to a near-optimal result.**Greedy algorithms**(which are stepwise optimal, rather than globally optimal) are an example. One greedy approach to variable selection is to measure individual source variable importance in isolation, and then include in the model all variables having sufficiently high individual importance.[xviii] One can apply the same technique to evaluating possible synthetic variables, if the computation does not take too long. This approach assumes that combinations of source variables having high individual importance capture all important interactions, which is a strong assumption that one should make with care. An alternative heuristic is**constructive search**, where one pieces together a complete solution iteratively.

- A
**stochastic (local) search**algorithm explores the space of possible combinations by generating random variables that determine how the algorithm explores the solution space. Simulated annealing and evolutionary algorithms are two important classes.[xix]In**simulated annealing**the algorithm starts with an arbitrary solution. At each iteration the algorithm transitions from the current solution to a neighboring solution with a probability that varies with three quantities: the quality of the current solution, the quality of the neighboring solution, and time. The probability decreases with time, all other things being equal, so that the algorithm initially explores many possible solutions and eventually settles on a good one.**Evolutionary programming**mimics various features of natural selection, including fitness, mutation, crossover, and inheritance. The algorithm maintains a population of candidate solutions at each iteration. The population evolves when the algorithm applies inheritance, mutation, and crossover stochastically, with more fit candidates selected randomly to breed the next generation of solutions. Frequently solutions have binary representations, to simplify the mechanics of the recombination operations. Other encodings are possible, and finding a suitable encoding is perhaps the most important part of building an evolutionary algorithm.[xx]

**MapReduce**is embarrassing parallelism[xxi] implemented on a computing cluster or grid. Typically a large dataset is divided across several servers. The master node “maps” the same analytical operation to each server, which performs the operation on the portion of the data stored there. Then, the master node “reduces” the output of all of the other servers’ analytical operations to a single overall solution. Apache Hadoop[xxii] implements MapReduce. One can use MapReduce to make most of the above automated techniques execute more quickly. For example, stochastic search has now been implemented in the MapReduce framework.[xxiii] Thus MapReduce is not an analytical technique in its own right, but a framework for distributing certain kinds of analytical loads.

## When to Use It

One synthesizes variables either to reduce dimensionality or to increase model power. Dimensionality reduction is appropriate when there is substantial correlation among some of the source variables. Otherwise, synthetic variables are useful when two or three variables significantly interact in ways a model cannot capture merely by including the variables separately.

## Example

A well-known early example of automated heuristic search that discovered a model containing e.g. cross-product terms is the series of Bacon algorithms developed by Pat Langley in a production-rule environment based on LISP.[xxiv] The Bacon algorithms used 16 heuristics, and rediscovered Ohm’s law, Newton’s Law (of gravitation), one of Kepler’s Laws (of planetary motion), Boyle’s Ideal Gas Law, Snell’s Law (in optics), and Black’s Law (of specific heat). All of these laws contain synthetic variables.

[i] Thus the careful data scientist routinely considers how they might profitably **enrich** the datasets available to them within their employer’s organization with third-party datasets. Our list of third-party data sources is a great place to start your search for third-party data.

[ii] A classic early treatment of visualization is John W. Tukey, *Exploratory Data Analysis* (Pearson, 1977). See also William S. Cleveland, *Visualizing Data* (AT&T, 1993) and Winston Chang, *R Graphics Cookbook* (O’Reilly, 2013). NIST’s online handbook of engineering statistics has a chapter on exploratory data analysis that includes visualization techniques: http://www.itl.nist.gov/div898/handbook/eda/eda.htm (visited March 19, 2014).

[iii] The rise of big data has motivated a great deal of recent research about dimensionality reduction. Here is a sample: John A. Lee and Michel Verleysen, *Nonlinear Dimensionality Reduction* (Springer, 2007); Michael Kirby, *Geometric Data Analysis: An Empirical Approach to Dimensionality Reduction and the Study of Patterns* (Wiley, 2001); Oliver Kramer, *Dimensionality Reduction with Unsupervised Nearest Neighbors* (Springer, 2013); Haiping Lu, *Multilinear Subspace Learning: Dimensionality Reduction of Multidimensional Data* (Chapman and Hall/CRC, 2013); B. Sarojini and N. Ramaraj, *Dimensionality Reduction Techniques in Medical Informatics: An Empirical Study on Feature Ranking Approaches* (Lambert, 2012); Liang Sun, Shuiwang Ji, and Jieping Ye, *Multi-Label Dimensionality Reduction* (Chapman and Hall/CRC, 2014); Jianzhong Wang, *Geometric Structure of High-Dimensional Data and Dimensionality Reduction* (Springer, 2012).

[iv] We’ll return to bag of words text mining in a separate post.

[v] Polynomial models of higher than third order are rarely used. John Neter, William Wasserman, and Michael H. Kutner, *Applied Linear Statistical Models*, 3^{rd} Ed. (Irwin 1990), p. 318.

[vi] There are model-independent measures of relative variable importance that one can apply to collections of variables, notably Akaike’s Information Criterion and its analogs. See Gerda Claeskens and Nils Lid Hjort, *Model Selection and Model Averaging* (Cambridge University Press, 2010). We will review variable-importance measures in a separate post.

[vii] A separate post will treat assessing classification and prediction model fit, and optimization model performance.

[viii] See for example Robert R. Hoffman, “The Problem of Extracting the Knowledge of Experts from the Perspective of Experimental Psychology,” *AI Magazine* Vol. 8 No. 2 (AAAI, 1987), pp. 53-67, available at http://aaaipress.org/ojs/index.php/aimagazine/article/viewFile/583/519 (visited March 19, 2014); and Judith Reitman and Henry H. Rueter, “Extracting Expertise from Experts: Methods for Knowledge Acquisition.” *Expert Systems* 4(3) (1987),

pp. 152-168.

[ix] Such as Jacob Yerushalmy, “Factors in Human Longevity,” *American Journal of Public Health Nations Health* Vol. 53 No. 2 (Feb. 1962), pp. 148-162, available at http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1253894/ (visited March 20, 2014).

[x] http://www.healthindicators.gov/Indicators/ (visited March 20, 2014).

[xi] Naturally the research literature could lead you to the same conclusion. J. Hata et/al, “Combined Effects of Smoking and Hypercholesterolemia on the Risk of Stroke and Coronary Heart Disease in Japanese: the Hisayama Study,” *Cerebrovascular Diseases* Vol. 31 No. 5 (2011), pp. 477-484, available at http://www.ncbi.nlm.nih.gov/pubmed/21358199 (visited March 20, 2014).

[xii] http://www.noaanews.noaa.gov/stories2010/20100930_hurricanetrack.html (visited March 20, 2014).

[xiii] When the count merely determines whether something is turning at all, one has implemented a variant of the Wald-Wolfowitz runs test. http://en.wikipedia.org/wiki/Wald%E2%80%93Wolfowitz_runs_test (visited March 20, 2014).

[xiv] http://www.cs.ubc.ca/labs/beta/Courses/CPSC532D-05/Slides/ch1-slides-2p.pdf (visited March 20, 2014).

[xv] Hemant Ishwaran and J. Sunii Rao, “Spike and Slab Variable Selection: Frequentist and Bayesian Strategies,” *The Annals of Statistics* Vo. 33 No. 2(2005), pp. 730-773, available at http://arxiv.org/pdf/math/0505633.pdf. See also http://journal.r-project.org/archive/2010-2/RJournal_2010-2_Ishwaran~et~al.pdf and http://cran.r-project.org/web/packages/spikeslab/spikeslab.pdf (visited March 20, 2014). We’ll return to variable-selection algorithms in a future post.

[xvi] The academic community has published many references on the general subject. Here are a few: http://www.math.uwaterloo.ca/~aghodsib/courses/f06stat890/readings/tutorial_stat890.pdf, http://www.public.asu.edu/~huanliu/papers/dm07.pdf,

http://www.cc.gatech.edu/~isbell/tutorials/dimred-survey.pdf,

http://www.cs.binghamton.edu/~lyu/SDM07/DR-SDM07.pdf (visited March 2, 2014).

[xvii] http://en.wikipedia.org/wiki/Branch_and_bound, http://www.stanford.edu/class/ee364b/lectures/bb_slides.pdf (visited March 2, 2014).

[xviii] This is essentially what Google did in its flu model. Jeremy Ginsberg et/al, “Detecting Influenza Epidemics Using Search Engine Query Data,” *Nature* Vol. 457 (Feb. 2009), available at http://static.googleusercontent.com/media/research.google.com/en/us/archive/papers/detecting-influenza-epidemics.pdf (visited March 17, 2014).

[xix] See http://en.wikipedia.org/wiki/Stochastic_optimization#Randomized_search_methods for a longer list of randomized-search strategies (visited March 20, 2014). See also James C. Spall, *Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control* (Wiley, 2005).

[xx] http://www.cse.unsw.edu.au/~billw/cs9414/notes/ml/05ga/05ga.html (visited March 20, 2014).

[xxi] http://en.wikipedia.org/wiki/Embarrassingly_parallel (visited March 20, 2014).

[xxii] http://hadoop.apache.org (visited March 20, 2014).

[xxiii] Xin Du et/al, “High Performance Parallel Evolutionary Algorithm Model Based on MapReduce Framework,” *International Journal of Computer Applications in Technology* Vol. 46 No. 3 (March, 2013), pp. 290-295, available at http://dl.acm.org/citation.cfm?id=2463891; Pedro Fazenda et/al, “A Library to Run Evolutionary Algorithms in the Cloud Using MapReduce,” *Applications of Evolutionary Computation, Lecture Notes in Computer Science* Vol. 7248 (Springer 2012), pp. 416-425, available at http://link.springer.com/export-citation/chapter/10.1007/978-3-642-29178-4_42 (visited March 20, 2014).

[xxiv] Pat Langley et/al, “The Bacon Programs,” *Scientific Discovery: Computational Explorations of the Creative Processes* (MIT Press, 1987), pp. 65-194. Computational model discovery remains an area of active research. See Saso Dzeroski and Ljupco Todorovski, eds., *Computational Discovery of Scientific Knowledge: Introduction, Techniques, and Applications in Environmental and Life Sciences* (Springer, 2007).

## Leave a comment