Data Science Design Patterns

# Data Science Design Patterns

A data science design pattern is very much like a software design pattern or enterprise-architecture design pattern. It is a reusable computational pattern applicable to a set of data science problems having a common structure, and representing a best practice for handling such problems. This page lists our data science design pattern blog posts, most recent first.

Data science design patterns generally mix several computational techniques. Our posts survey the techniques and give you pointers into the literature, so you can study a design pattern thoroughly before applying it. In some cases whole books have been written about a single design pattern. For example, there are several books on kernel smoothing (pattern #2). As a result, our posts sometimes cover a lot of ground (including a lot of jargon) in a small space. Likewise, the examples we present are often somewhat contrived, to present technical ideas in their most basic form.

We encourage you to read the endnotes and follow their hyperlinks (when available). Often the notes reference works containing more complex examples, often illustrated graphically. Likewise, the real-world examples we recommend at the end of our posts give you a feel for the kinds of problems the pattern applies to, and for the technicalities that arise in applying the pattern. Make sure you take the time to review these examples carefully.

Please send your comments, questions, suggestions, and requests to designPatterns@mosaicdatascience.com. Weâ€™re eager to make these design patterns as useful as we can, so weâ€™ll work hard to account for your feedback.

Thanks for visiting Mosaic Data Science!

Data Science Design Pattern #5: Combining Source Variables

posted by Mosaic Data Science

Variable selection is perhaps the most challenging activity in the data science lifecycle. The phrase is something of a misnomer, unless we recognize that mathematically speaking weâ€™re selecting variables from the set of all possible variablesâ€”not just the raw source variables currently available from a given data source.[i] Among these possible variables are many combinations of source variables. When a combination of source variables turns out to be an important variable in its own right, we sometimes say that the source variables interact, or that one variable mediates another. Weâ€™ll coin the phrase synthetic variable to mean an independent variable that is a function of several source variables, regardless of the nature of the function.

Data Science Design Pattern #4: Transformations of Individual Variables

posted by Mosaic Data Science

Itâ€™s very common while exploring data to experiment with transformations of individual variables.Â  Some transformations rescale while preserving order; others change both scale and order.Â  In this post we describe some common ways to transform individual variables, and explore how doing so may benefit an analysis.Â  (Weâ€™ll tackle transformations of multiple variables in another post.)

Data Science Design Pattern #3: Handling Null Values
posted by Mosaic Data Science

Most data science algorithms do not tolerate nulls (missing values).Â  So, one must do something to eliminate them, before or while analyzing a data set.

Data Science Design Pattern #2: Variable-Width Kernel Smoothing
posted by Mosaic Data Science

A fundamental problem in applied statistics is estimating a probability mass function (PMF) or probability density function (PDF) from a set of independent, identically distributed observations. When one is reasonably confident that a PMF or PDF belongs to a family of distributions having closed form, one can estimate the formâ€™s parameters using frequentist techniques such as maximum likelihood estimation, or Bayesian techniques such as acceptance-rejection sampling.

posted by Mosaic Data Science

This is the first in a series of technical blog posts describing design patterns useful in constructing data science models, including decision-support and decision-automation systems. We hope that this blog will become a clearinghouse within the data science community for these design patterns, thereby extending the design-pattern tradition in software development and enterprise architecture to data science.