Improving availability of multicore real-time systems suffering both permanent and transient faults

Junlong Zhou, Nanjing University of Science and Technology
Xiaobo Sharon Hu, University of Notre Dame
Yue Ma, University of Notre Dame
Jin Sun, Nanjing University of Science and Technology
Tongquan Wei, East China Normal University
Shiyan Hu, Michigan Technological UniversityFollow

Document Type

Article

Publication Date

8-14-2019

Department

Department of Electrical and Computer Engineering

Abstract

CMOS scaling has greatly increased concerns for both lifetime reliability due to permanent faults and soft-error reliability due to transient faults. Most existing works only focus on one of the two reliability concerns, but often times techniques used to increase one type of reliability may adversely impact the other type. A few efforts do consider both types of reliability together and use two different metrics to quantify the two types of reliability. However, for many systems, the user's concern is to maximize system availability by improving the mean time to failure (MTTF), regardless of whether the failure is caused by permanent or transient faults. Addressing this concern requires a uniform metric to measure the effect due to both types of faults. This paper introduces a novel analytical expression for calculating the MTTF due to transient faults. Using this new formula and an existing method to evaluate system MTTF, we tackle the problem of maximizing availability for multicore real-time systems with consideration of permanent and transient faults. A framework is proposed to solve the system availability maximization problem. Experimental results on a hardware board and simulation results of synthetic tasks show that our scheme significantly improves system MTTF (and hence availability) compared with existing techniques.

Publisher's Statement

Publication Title

IEEE Transactions on Computers

Recommended Citation

Zhou, J., Hu, X. S., Ma, Y., Sun, J., Wei, T., & Hu, S. (2019). Improving availability of multicore real-time systems suffering both permanent and transient faults. IEEE Transactions on Computers, 68(12), 1785-1801. http://doi.org/10.1109/TC.2019.2935042
Retrieved from: https://digitalcommons.mtu.edu/michigantech-p/908

Michigan Tech Publications

Improving availability of multicore real-time systems suffering both permanent and transient faults

Document Type

Publication Date

Department

Abstract

Publisher's Statement

Publication Title

Recommended Citation

LINKS

Browse

Search

Author Corner

Links

Michigan Tech Publications

Improving availability of multicore real-time systems suffering both permanent and transient faults

Authors

Document Type

Publication Date

Department

Abstract

Publisher's Statement

Publication Title

Recommended Citation

Share

LINKS

Browse

Search

Author Corner

Links