Failure patterns in operating systems: An exploratory and observational study
•A protocol to discover operating systems failure patterns is proposed.•Discovered 45 genuine failure patterns.•Failures in operating system services were the most prevalent.•Found empirical evidences of failure correlation (cross- and autocorrelation). Sophisticated critical computer applications n...
Saved in:
Published in: | The Journal of systems and software Vol. 137; pp. 512 - 530 |
---|---|
Main Authors: | , |
Format: | Journal Article |
Language: | English |
Published: |
Elsevier Inc
01-03-2018
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | •A protocol to discover operating systems failure patterns is proposed.•Discovered 45 genuine failure patterns.•Failures in operating system services were the most prevalent.•Found empirical evidences of failure correlation (cross- and autocorrelation).
Sophisticated critical computer applications need to run on top of operating system (OS) software. Given the natural intrinsic dependency of user applications on the OS software, OS failures can severely impact even the most reliable applications. Thus, it is essential to understand how OS failures occur in order to improve software reliability. In this paper, we present an exploratory and observational study on OS failure patterns. We analyze 7007 real OS failures collected from 566 computers used in different workplaces. We start with a general characterization of the failure dataset examined in this study, where interesting findings are presented, e.g., the most frequent failure types per period of a day and per different workplaces. Next, we investigate the existence of failure patterns. For this purpose, we introduce an OS failure pattern discovery protocol that identifies failure patterns exhibiting consistency across different computers used in the same as well as different workplaces. In total, we discovered 45 failure patterns with 153,511 occurrences. Based on these patterns, we found that the most prevalent failures were related to the software updates of the OS components. The main causes of these failures involved infrastructural and environmental factors such as disk-space unavailability and concurrent execution of OS services. Empirical evidence of time-correlated failures of these OS components is also discussed in this paper. Other findings include the OS components that contributed more to create the discovered failure patterns and the most prevalent combination of failure events and their temporal order. This study aims at to contribute to a better understanding of the mechanisms behind OS failures. |
---|---|
ISSN: | 0164-1212 1873-1228 |
DOI: | 10.1016/j.jss.2017.03.058 |