A Survey on Multithreading Alternatives for Soft Error Fault Tolerance


ÖZ I., ARSLAN S.

ACM COMPUTING SURVEYS, cilt.52, sa.2, 2019 (SCI-Expanded) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 52 Sayı: 2
  • Basım Tarihi: 2019
  • Doi Numarası: 10.1145/3302255
  • Dergi Adı: ACM COMPUTING SURVEYS
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus
  • Anahtar Kelimeler: Soft error, thread-level redundancy, redundant multithreading, RELIABILITY, SYSTEMS, REDUNDANCY, EXECUTION, DESIGN, CORES
  • Marmara Üniversitesi Adresli: Evet

Özet

Smaller transistor sizes and reduction in voltage levels in modern microprocessors induce higher soft error rates. This trend makes reliability a primary design constraint for computer systems. Redundant multithreading (RMT) makes use of parallelism in modern systems by employing thread-level time redundancy for fault detection and recovery. RMT can detect faults by running identical copies of the program as separate threads in parallel execution units with identical inputs and comparing their outputs. In this article, we present a survey of RMT implementations at different architectural levels with several design considerations. We explain the implementations in seminal papers and their extensions and discuss the design choices employed by the techniques. We review both hardware and software approaches by presenting the main characteristics and analyze the studies with different design choices regarding their strengths and weaknesses. We also present a classification to help potential users find a suitable method for their requirement and to guide researchers planning to work on this area by providing insights into the future trend.