Diagnosing Windows 10 Crashes

A Mosaic Data Science Case Study

Download PDF


Background

A leading management consulting firm recently initiated an enterprise-wide upgrade of existing and new laptops to the Windows 10 operating system (OS). The firm deployed a fleet of 3,000 laptops to employees around the world. An alarming number of consultants began reporting the infamous ‘Blue Screen of Death’ (BSOD). This message appears right after a system wide failure and provides a cryptic description of the error. BSOD’s may indicate significant problems with a computer that could affect the computer’s performance and long-term reliability.

windows 10 1  

Figure 1. BSOD screen in Windows 7 (left) and Windows 10 (right). Images from Wikipedia, 2017.

Management consultants need access to their laptops in order to generate revenue for the firm. Consultants lose valuable billable hours when they are not able to have reliable access to their work computers, costing the firm money on the top and bottom line, and leading to customer dissatisfaction.

Figure 2. Number of BSOD events per week

The firm’s IT department asked Mosaic, a leading data mining company, to determine potential causes of the computer crashes and to generate a set of metrics that could be integrated into the firm’s business intelligence systems to detect warning signs earlier in the future.

Analysis

Mosaic started this effort with a root cause analysis to understand why these crashes were occurring. Was the problem Windows 10 itself? Was a particular configuration of hardware and software drivers to blame? Or were employees simply asking more of their computers than ever before?

The Mosaic data science consultants aggregated information from the company’s internal databases and external sources (such as computer manufacturers’ websites) to compile a complete list of the hardware specifications for all Windows 10 computers currently in use at the company, and the event log for each BSOD event that occurred in one of these computers.

For each computer, Mosaic counted the number of BSOD events that occurred. Mosaic then separated the computers into groups by hardware component models to compare the frequency of BSODs in computers with different component types. Because the Windows 10 laptops were rolled out to employees over several months, the total number of BSODs for each group of computers was divided by the number of months the machines in that group had been in use. This provided the average number of BSODs per month of computer use for each group and controlled for potential differences in the age of the computers.

Mosaic analyzed the BSOD incidence for all hardware and firmware configurations present in the Windows 10 fleet in order to isolate the specific components that may have contributed to higher incidence of blue screen events. This was particularly challenging because not all variations of each component appeared in large numbers of computers, and sometimes particular components only appeared with one combination of other hardware models—making it difficult to disentangle the effect of, for instance, a single video adapter model from a particular hard drive model.

Looking across the most common computer versions, Mosaic noticed one component that seemed to coincide with higher BSOD rates: a particular internal video adapter (also known as a graphics card). Figure 3 shows the average number of BSOD events per month for the eight computer versions used by at least 150 employees at the company. Three of the top four versions with the highest crash rates use the same video card; this component is labelled “Video Card A” in the figure.

Figure 3. BSOD events per month in most common computer versions. The number of computers of each version is listed as “N” below the version number on the x-axis.

 

However, this alone was not conclusive evidence that the video card caused the crashes. For example, it is possible that Video Card A tended to be installed together with Hard Drive B, and that Hard Drive B was the true cause of the BSOD events in these machines. To determine the likelihood that the video card caused, and didn’t just correlate with, machine instability, Mosaic’s analytics consulting team also looked at the reference codes in the BSOD event logs for all crashes occurring in these eight computer versions. Grouping together the versions with Video Card A and those with other video cards, Mosaic’s data science consultants looked at the average number of BSOD events of each type per month. Figure 4 shows the average number of events per machine month for the four most frequent event codes.

Figure 4. BSOD events per month for most common BSOD types.

 

Figure 4 shows that among machines with Video Card A, the most frequent BSOD type was an internal error in the video scheduler, which indicates a video card-related crash. The second most-frequent type was a driver power state failure event, which in some cases may also have been due to failures in the driver associated with the computer’s internal video card. The most frequent BSOD type in computers without Video Card A was also the video scheduling error type, but these occurred at much lower rates than in the Video Card A laptops.

Mosaic presented these two pieces of evidence to the company’s IT department to support Video Card A as a likely contributor to the increased frequency of BSODs in the new fleet of Windows 10 machines. The company planned to contact the computer manufacturer and request additional stability testing on these models. Mosaic also worked with the firm’s business intelligence software developers to support the integration of the data visualizations used in this consulting engagement into the firm’s existing IT monitoring dashboards. This will allow executives to monitor BSOD and application crash rates for different groups of laptops going forward. Following successful dashboard integration, the firm will be better positioned to act on instability early, such as by replacing problematic components at the first signs of trouble. This will enable the firm’s management to ensure the reliability of the technology provided to its top revenue-generating assets.

Results

In the hyper competitive world of Management Consulting, thirty minutes of down time may lead one company to cancel a lucrative relationship and hire a competitor. Management consulting companies also need to retain and attract top talent by offering the latest in technological advances. With the power of predictive analytics, this Management Consulting firm can keep their consultants running at billable hours while upgrading to the latest technology with minimal downtime.