Timely failure detection in a large distributed real-time system
The paper describes the experience of designing and implementing failure detection and reporting in a large distributed real time system used for air traffic control (ATC). We believe that systematic analysis is needed to guide the failure detection design and track the large number of failures that...
Saved in:
Published in: | Proceedings of Words '94. The First Workshop on Object-Oriented Real-Time Dependable Systems pp. 118 - 123 |
---|---|
Main Authors: | , |
Format: | Conference Proceeding |
Language: | English |
Published: |
IEEE
1994
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The paper describes the experience of designing and implementing failure detection and reporting in a large distributed real time system used for air traffic control (ATC). We believe that systematic analysis is needed to guide the failure detection design and track the large number of failures that it deals with. Analysis such as how fast failures have to be detected should be performed carefully to avoid redesigns later. A comprehensive analysis also provides a basis for testing the design subsequently, during which fault injection and extended testing are needed to evaluate and debug the design. Failure detectors should detect specific failures so that appropriate reports and recovery actions can be initiated after detection. |
---|---|
ISBN: | 9780818670831 0818670835 |
DOI: | 10.1109/WORDS.1994.518680 |