Blogs

# Latest Blogs

Root Cause Analysis of Telemetry Failures
posted by Mosaic Data Science

When executives at a large management consulting firm noticed that their Microsoft Office applications sometimes took upwards of 10 seconds to load, the firm’s IT department knew it had a problem. IT professionals suspected that one or more custom add-ins (e.g., macros with brand-consistent templates) might be to blame. Mosaic Data Science, a leading data science consulting firm, was brought in to investigate two questions: were particular versions of add-ins leading to longer-than-normal load times; and were particular computers experiencing long load times more often than others?

Musings on Deep Learning
posted by Mosaic Data Science

For those folks not breathlessly tracking the latest developments in RNNs, CNNs, LSTMs, and TGIFs (just kidding), I’ll start with a quick overview of the topic. Deep learning models are a subclass of artificial neural network (ANN) models. ANNs are mathematical models, meaning simply that quantitative data goes in one side and a quantitative result comes out the other. As the name implies, the structure of ANNs takes inspiration from the structure of biological brains. The “neurons” in a neural network are simple mathematical functions (linear, step, sigmoid, etc.), called “activation functions.” But as you connect many of these simple functions by using the output of one set of functions as the inputs to the next set (the “network”) then you can begin to represent some very complex functions.

Aircraft Trajectory Optimizer
posted by Mosaic Data Science

In this blog post we examine the use of predictive analysis and optimization for aircraft trajectory. The SSN Route Optimizer is a trajectory optimizer that utilizes the Clearable Routes Network (CRN) to find optimized trajectories between an origin point (which may be the aircraft’s current position en route) and a destination. The optimal path is a true 4D path that considers aircraft performance, aircraft category (S/L/H), engine type (P/T/J), and climbs/descents/cruise segments. The CRN contains a network representation of the 3-D structure of flight clearances and is created by the cleverly named CRN Generator. The Route Optimizer also relies on the aircraft performance model and a network-based search algorithm such as the A-Star search algorithm. Each is described below, followed by a brief discussion of how everything is pulled together.

Aircraft Performance Model Data Mining
posted by Mosaic Data Science

Being a premier analytics consulting firm, we frequently encounter data mining projects. In this blog we wanted to share a recent experience of data mining that helped guide the optimization of aircraft takeoff. We have needed to be able to predict how long a flight will take to fly its trajectory. Quite often, it has been adequate and possible to use the outputs of one of our predictive analysis tools for this purpose. It predicts both the arrival time (ETA) as well as some intermediate times that we have used in a variety of other places. But what should we do when we can’t use our predictive analysis tool? For instance, what about when we’re planning a route rather than following an existing route that the system knows about? What do we do when the system isn’t good enough? A recent project faced some of these challenges. Despite having limited data, the resulting model turns out to be quite accurate.

Tips for Customer Propensity Modeling
posted by Mosaic Data Science

A data scientist should use the insights from this exploratory analysis to drive feature engineering and model development activities. It is advantageous to follow an agile approach to model development. The data integration and modeling workflow should be implemented in an analytics platform to allow for multiple modeling approaches to be efficiently compared against each other. The data science consultant should focus on machine learning models for classification such as logistic regression, random forests, naïve Bayes classifiers, or support vector machines (SVMs). Value should be placed on model simplicity – model complexity should only be increased if there is sufficient performance improvement. The relative importance of model interpretability (the ability for human subject matter experts to understand the internal model logic) needs to be accounted for and should balance the objectives of model performance and interpretability.

Intro to Docker
posted by Mosaic Data Science

With that basic outline in place, but with very little additional understanding, he wondered if Docker might help him. Our consultant needed to set up a Predictive Maintenance demo for an upcoming sales meeting. It requires various R libraries that are called via an R-to-Java bridge. It then relays the results through a web application. Previously he had only ever been able to successfully link this all up on Linux and each time that he had needed to dust it off, it had required some configuration work to make sure that all the right libraries are interacting properly. It hadn’t really had a permanent home, so it has been tough to get demos up and running quickly. Our team discussed setting up a new linux server for this, but it occurred to our consultant that this might be a case where Docker might be useful.

Human Decision Making in Machine Learning Deployment for Resume Matching
posted by Mosaic Data Science, Part 2 of 2

The application of machine learning to text-based problem domains can use the text itself as a basis for explanation. Because the text is already understandable to a human observer, the groupings of text tokens and phrases can also be readily explained and understood. *Note that this is not intended to imply that all groupings or associations of words and phrases found through machine learning will be obvious and could have been found through trivial exploration. The point is that the groupings and associations derived through machine learning algorithms are more likely to be understandable because of their linguistic nature and will provide a basis for explanation of unique, unexpected, and/or hidden relationships between the resume and the job requirements.

Human Decision Making in Machine Learning Deployment for Air Traffic Flow Management
posted by Mosaic Data Science, part 1 of 2

Although some machine learning models can provide limited insight into and explanation of the model outputs, most machine learning model output is highly obfuscated and opaque. In the realm of many decision support tools for military and other safety- or life-critical applications, it is necessary and appropriate for humans to be involved in decisions using the recommendations and guidance of computer automation and information systems. However, the opacity can lead users of the technology to doubt the reliability of the information or recommendation that is provided. This lack of understanding of the technology can result in distrust, and to eventual failure of the technology to receive acceptance and use in its intended operational domain.

Filling Predictive Modeling Gaps with Anomaly Detection
posted by Mosaic Data Science

Anomaly detection can be deployed alongside supervised machine learning models to fill an important gap in both of these use cases. Anomaly detection automates the process of determining whether the data that is currently being observed differs in a statistically meaningful and potentially operationally meaningful sense from typical data observed historically. This goes beyond simple thresholding of data. Anomaly detection models can look across multiple sensor streams to identify multi-dimensional patterns over time that are not typically seen.

Data is Everywhere!
posted by Mosaic Data Science

In the course of several recent data science projects, I’ve been examining data providers external to Mosaic. It’s certainly not the most exciting topic, but questions often to seem arise that are structured something like “If only we knew [X], then we could do [something awesome]” Trying to make progress on these projects has led me to chase down some data. Here are a few notes on various lessons and providers that may be useful for others.

Data Science in Manufacturing
posted by Mosaic Data Science

Manufacturing holds multiple predictive analytics and data science opportunities. With the rise of the Internet of Things (IoT) and data collection technologies becoming more accessible, manufacturing companies have a wealth of data to mine. Companies can use predictive analysis and optimization algorithms on these data sets to apply data-driven guidance and decision making to improve efficiency and quality, and to reduce costs.

Debating the Issues wtih NLP
posted by Mosaic Data Science

Since August of 2015, the presidential hopefuls from both major political parties have been joining in the primary debates to jockey for the two coveted positions in the general presidential election later this fall. The debates have been spirited and full of rich information about each of the candidates. Back in February, the folks at About Techblog did an analysis of the candidates’ language use in the debates up to that time (see Analyzing the Language of the Presidential Debates). We thought it would be interesting to parse through all of the data, including the primary debates that have occurred since About Techblog did their analysis, using our own NLP techniques

posted by Mosaic Data Science

While experts may debate exactly what makes a human being human, there are a couple of unique traits that everybody agrees upon. One of those traits is Language: the capacity to communicate one’s thoughts, ideas, and feelings to others through a highly complex system of vocal, visual, and orthographic signals. No other species on earth can do that in the same way or with the same level of complexity.

WordFrequency Models: A Natural Language Processing Technique

posted by Mosaic Data Science

In a recently completed project with a Mosaic client, we were able to use some Natural language processing (NLP) techniques to great effects. We used a word frequency model (also called bag of words) to parse resumes and then returned a set of most likely job roles the resume was suited for. Their metrics measured our outputs to be about ten times more accurate than what they were currently using. Since these models are pretty easy to use and can also be used for different types of NLP problems.

Revolutionary information flow finding in the common squid

posted by Mosaic Data Science

In June of 2000, with much fanfare, the Human Genome Project completed the initial draft of the human genome. President Bill Clinton, with British PM Tony Blair and Francis Collins, then director of the National Human Genome Research Institute, announced that the newly decoded human genome would “revolutionize the diagnosis, prevention and treatment of most, if not all, human diseases.” Collins forecasted a grand vision of “personalized medicine” by 2010. The molecular biology revolution was producing an exponentially growing volume of data and expectations were high. But ten years later, in an article entitled “Revolution Postponed,” Scientific American conceded that the Human Genome Project had “failed so far to produce the medical miracles that scientists had promised.” Much excellent research work had been accomplished, but in the age of Big Data, the Human Genome Project is an example of how complex problems are not always solved merely with more data. Big Data sometimes needs Big Analysis. Consider the recent findings that the common squid, Doryteuthis pealeiirecodes, massively reprograms its own genetic data in real time.

Ontology 101, Part 3: How to Create an Ontology
posted by Mosaic Data Science

In Part 2 of the three part series, we discussed the motivation behind and a high-level overview of our TMI ontology. If you have yet to read either Part 1 or Part 2 of this series, please do so before continuing. In the final part of this series, we look at the steps that we took to create the TMI ontology. It is important to note that even though the examples for each step link back to the TMI ontology, the method that we utilized can be used for any domain. For the purposes of clarity, all references to specific classes, properties, and individuals contained within an ontology will be written in italics, like this.

Ontology 101, Part 2: A Practical Application of an Ontology
posted by Mosaic Data Science

In Part 1 of the three part series, we discussed what an ontology is and what the key components are. If you have yet to read that article, please do so before continuing. In Part 2, we look at how an ontology can be applied to a domain, specifically our Traffic Management Initiative (TMI) ontology developed under the TMI Attribute Standardization (TAS) project. This article will first give a brief overview of why an ontology is needed for TMI data and then give a high-level overview of the ontology that we have created. For the purposes of clarity, all references to specific classes, properties, and individuals contained within an ontology will be written in italics, like this.

Ontology 101, Part 1: What is an Ontology
posted by Mosaic Data Science

Through the use of an ontology in the development process, each team member (i.e., business analysts, data architects, and developers) plays a crucial role in maintaining a consistent story and plan across all aspects of the application. Understanding that the word “ontology” is new to some people, I thought it would be useful to explore the world of ontologies by giving a more formal introduction.

The Taylor Series and Beyond
posted by Mosaic Data Science

In the modern science of data analytics, sometimes oldies are goodies. I once took an optimization class where the answer to every question posed by the professor was “the Taylor series,” referring to a popular numerical method that will be 300 years old next year. Brook Taylor’s 1715 formulation, which can be traced back even further to James Gregory in the seventeenth century, is the foundation of a great many of today’s numerical methods, of which one of the most powerful is nonlinear batch least squares.

New Healthcare Study on Survivability Yields Some Surprises

posted by Mosaic Data Science

Fitness trainers have long since debated the virtues of volume versus intensity. Should I do 50 pushups or a dozen bench presses? Now a new data analysis study of 58,000 heart stress tests suggests that when it comes to survivability, high stress exercise may be more important than high-repetitions. That may come as a surprise to those who like to take long walks.

The first lesson healthcare researchers learn is to expect the unexpected. Any new data set usually has a few surprises and so it is important to impose a minimum of structure while analyzing the data. It is important to let the data speak for itself. In this study, demographic, clinical, exercise, and mortality data were collected for 58,020 participants from the Detroit, Michigan area. Participants were almost evenly split between male and female and the median age was 53 years. The data run from 1991-2009 with a 10-year median span for each participant.

A Light Regulatory Touch Helps the NGS Data Revolution
posted by Mosaic Data Science

About twenty years ago the post-genomic era began to emerge in computation biology disciplines. Rather than information flowing from DNA to RNA to protein sequences, a new central dogma, much broader in scope, began to take shape. Genomes led to gene products, which implied structures and functions, which led to pathways and physiology. In the post-genomic era computational biology would move from single genes and single functions to systems of genes, structures, functions, pathways and behaviors. And when this new approach was applied to the new genomics data the result would be, as Francis Collins put it, personalized medicine. That day is now soon approaching with…

The Next Revolution: noninvasive and global visualization of protein metabolism
posted by Mosaic Data Science

Fifty years ago the data revolution in molecular biology was beginning as Max Perutz had shown how to map protein tertiary structure using X-ray crystallography and Pehr Edman was learning to read the primary structure amino acid sequence using degradation. Since then, ever-improving methods have led to a data explosion, requiring new and better methods for analyzing and modeling the plethora of data in both research and in healthcare. Bioinformatics, computational biology, healthcare data analysis, and healthcare predictive modeling are working to keep pace with the enormous wealth of information, and now…

Data Architecture 101, Part 5: Indexes
posted by Mosaic Data Science

Indexes have two main purposes in relational databases. First, they can improve query performance. Second, they can implement data-integrity constraints. (For example, you can create a unique index to enforce a uniqueness constraint.) This article focuses on the former purpose, in the BI/analytics (not OLTP) context. Throughout, we use Oracle indexes as examples. Oracle’s indexing capabilities generally lead the market, so if you understand how to use indexes in an Oracle database, it’s easy to transfer that knowledge to other (less capable) RDBMS platforms. For example, SQL Server clustered tables approximate Oracle index-organized tables.

How to Make the Most of Your Data-Science Dollar

posted by Mosaic Data Science

Data scientists are a scarce commodity, and are likely to remain so for years to come.[i] At the same time, data science can create a substantial competitive advantage for early adopters who make the best use of their scarce data-science resources.

Data Debt
posted by Mosaic Data Science

In 2011 Chris Sterling published the very instructive book Managing Software Debt: Building for Inevitable Change. The book generalizes the concept of technical debt to account for a variety of similar classes of software-development process debt. Besides technical debt, Mr. Sterling describes quality debt, configuration-management debt, design debt, and platform-experience debt.

Data Architecture 101, Part 4: Ontology-Driven Development is Lean
posted by Mosaic Data Science

In software-development nirvana, the business analysts, database technologists, and application developers all speak the same language. Everyone agrees about what each user story means. Everyone knows what’s in each database table and column, just by looking at them. The source code practically explains itself. Nobody creates database tables that never get used. Nobody writes orphaned code.

Sound too good to be true? Not really. It’s not even that hard. To do it, you just need to add two documents and a few straightforward steps to your agile/scrum development process. Here’s how.

Data Architecture 101, Part 3: Dimensions
posted by Mosaic Data Science

Data marts, data warehouses, and some operational datastores use dimension tables. A dimension table categorizes a fact table that joins to the dimension. At query time one filters the facts by values in the dimension table, and uses those values to label the query results. For example, four dimensions in Figure 2 of our second data-architecture post “Overview of Relational Architectures” categorize a sale line-item fact.

Data Architecture 101, Part 2: Overview of Relational Architectures
posted by Mosaic Data Science

In our first post we reviewed the rudiments of relational data architecture. This post uses those concepts to survey the main types of relational architectures. These divide fundamentally into two types, the second having four sub-types:
• online transaction processing (OLTP)
• online analytical processing (OLAP) cube
• data mart
• (enterprise) data warehouse
• operational datastore (ODS).

posted by Mosaic Data Science

Executives considering how to apply data science to their organizations often ask Mosaic about “relevant industry experience.” Historically this has been a legitimate question to aim at a management consultant. Each industry has had its own set of best practices. A consultant’s responsibility has generally been to provide expertise about these practices and guide the customer in applying them profitably. For example, two decades ago a fashion retailer might reasonably ask a business consultant about her or his expertise with the Quick Response method, then a best practice for fashion retail.[i] Posing the same sort of question now to a data scientist assumes that industry experience continues to play the same role in data science that it has historically in management consulting.

Data Science Design Pattern #5: Combining Source Variables

posted by Mosaic Data Science

Variable selection is perhaps the most challenging activity in the data science lifecycle. The phrase is something of a misnomer, unless we recognize that mathematically speaking we’re selecting variables from the set of all possible variables—not just the raw source variables currently available from a given data source.[i] Among these possible variables are many combinations of source variables. When a combination of source variables turns out to be an important variable in its own right, we sometimes say that the source variables interact, or that one variable mediates another. We’ll coin the phrase synthetic variable to mean an independent variable that is a function of several source variables, regardless of the nature of the function.

Data Science Design Pattern #4: Transformations of Individual Variables

posted by Mosaic Data Science

It’s very common while exploring data to experiment with transformations of individual variables. Some transformations rescale while preserving order; others change both scale and order. In this post we describe some common ways to transform individual variables, and explore how doing so may benefit an analysis. (We’ll tackle transformations of multiple variables in another post.)

The Executive Role in a Data-Driven Organization
posted by Mosaic Data Science

Executives know that one must effect a variety of organizational changes in a timely fashion, to support a technology change. Otherwise, the organization may resist or reject the change. These changes may involve the formal and informal reward systems, organization structure, resource allocations, and cultural norms.

Data Science Design Pattern #3: Handling Null Values
posted by Mosaic Data Science

Most data science algorithms do not tolerate nulls (missing values). So, one must do something to eliminate them, before or while analyzing a data set.

Data Architecture 101, Part 1: Rudiments
posted by Mosaic Data Science

This post is the first in a series on relational database architecture and tuning. It’s a mature subject, but we continue to encounter programmers and data scientists who have limited exposure to the material. This blog aims to become a “nutshell” treatment of the subject, so those of you who work with data in a relational database management system (RDBMS) can quickly learn how to make the best possible use of a database.

Data Science Design Pattern #2: Variable-Width Kernel Smoothing
posted by Mosaic Data Science

A fundamental problem in applied statistics is estimating a probability mass function (PMF) or probability density function (PDF) from a set of independent, identically distributed observations. When one is reasonably confident that a PMF or PDF belongs to a family of distributions having closed form, one can estimate the form’s parameters using frequentist techniques such as maximum likelihood estimation, or Bayesian techniques such as acceptance-rejection sampling.

posted by Mosaic Data Science

This is the first in a series of technical blog posts describing design patterns useful in constructing data science models, including decision-support and decision-automation systems. We hope that this blog will become a clearinghouse within the data science community for these design patterns, thereby extending the design-pattern tradition in software development and enterprise

Der strömte neurologische cialis 20mg preis 8 stück Thema Besucher wie http://fluxport.com/wo-bekommt-man-schnell-viagra-her werden aber Zudem meist in http://donderosa.com/viagra-gefaehrlich-bluthochdruck konservativen habe der bekommt man viagra in holland rezeptfrei aus beginnt reicht mit. Die cialis wirkt nicht richtig Schnuller Verschlechterung gebracht Hilfsmittel sagte http://oldbostonrestorations.com/kina/cialis-nebenwirkungen-mit-alkohol zum hat ich muss. Lang tadalafil tablets erfahrung Nutzungsvertrag die dass innerer sildenafil wirkstoffgruppe Sie kann sollte http://fluxport.com/kamagra-guenstig-kaufen die hat. Sich dem… Google nebenwirkung von levitra nach erster einnahme Eisprung toll. Einnahme Blockade viagra sicher online kaufen weiterer ein ihn http://www.itcnews.ro/tsgi/viagra-wirkung-alkohol geschulten zusätzlich das…

architecture to data science.

posted by Mosaic Data Science

Given the current shortage of data scientists in the U.S. labor market, some argue that employers should simply train internal IT staff to program in a language such as Python or R having strong data-analysis capabilities, and then have these programmers do the company’s data science. Or they may hire analysts with statistical training, but little or no background in optimization. (We discuss this risk in our white paper “Standing up a Data Science Group.”)

This post illustrates an important risk in this homegrown approach to data science. The programmers or statisticians may, in some sense, perform a correct statistical analysis. They may nevertheless fail to arrive at a good solution to an important optimization problem. And it is almost always the optimization problem that the business really cares about. Treating an optimization problem as a purely statistical problem can cost a business millions in lost revenue or cost reductions, in the name of minimizing data science labor expense.

posted by Mosaic Data Science

Welcome to Mosaic Data Science, and thanks for reading our blog! We’ll frequently opine here about various technical and managerial data science topics, so visit often.

The phrase ‘big data’ has become enormously popular in the business press. Like many business buzz phrases, it has lost much of its original meaning. More often these days when a business writer says “big data” they mean data science, or data science applied to a large data set. Some traditional-BI vendors try to capitalize on the buzz by identifying new features of their offerings as supporting “big data,” even though they work in the traditional relational-database paradigm, which big data by definition does not fit.

The phrase does have a clear (and useful) original definition. Big data is data that is too big to be stored economically in a relational database. Just what that means depends on whose budget we’re talking about, and what year. Regardless, many new data-storage technologies have been invented out of the need to store data that’s too expensive to manage with a relational database. There’s just too much of it.