Failure patterns in operating systems: An exploratory and observational study

•A protocol to discover operating systems failure patterns is proposed.•Discovered 45 genuine failure patterns.•Failures in operating system services were the most prevalent.•Found empirical evidences of failure correlation (cross- and autocorrelation). Sophisticated critical computer applications n...

Full description

Saved in:
Bibliographic Details
Published in:The Journal of systems and software Vol. 137; pp. 512 - 530
Main Authors: dos Santos, Caio Augusto Rodrigues, Matias, Rivalino
Format: Journal Article
Language:English
Published: Elsevier Inc 01-03-2018
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•A protocol to discover operating systems failure patterns is proposed.•Discovered 45 genuine failure patterns.•Failures in operating system services were the most prevalent.•Found empirical evidences of failure correlation (cross- and autocorrelation). Sophisticated critical computer applications need to run on top of operating system (OS) software. Given the natural intrinsic dependency of user applications on the OS software, OS failures can severely impact even the most reliable applications. Thus, it is essential to understand how OS failures occur in order to improve software reliability. In this paper, we present an exploratory and observational study on OS failure patterns. We analyze 7007 real OS failures collected from 566 computers used in different workplaces. We start with a general characterization of the failure dataset examined in this study, where interesting findings are presented, e.g., the most frequent failure types per period of a day and per different workplaces. Next, we investigate the existence of failure patterns. For this purpose, we introduce an OS failure pattern discovery protocol that identifies failure patterns exhibiting consistency across different computers used in the same as well as different workplaces. In total, we discovered 45 failure patterns with 153,511 occurrences. Based on these patterns, we found that the most prevalent failures were related to the software updates of the OS components. The main causes of these failures involved infrastructural and environmental factors such as disk-space unavailability and concurrent execution of OS services. Empirical evidence of time-correlated failures of these OS components is also discussed in this paper. Other findings include the OS components that contributed more to create the discovered failure patterns and the most prevalent combination of failure events and their temporal order. This study aims at to contribute to a better understanding of the mechanisms behind OS failures.
ISSN:0164-1212
1873-1228
DOI:10.1016/j.jss.2017.03.058