A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessorSingle-event upsets from particle strikes have become a key challenge in microprocessor design. Techniques to deal with these transients faults exist, but come at a cost. Designers clearly require accurate estimates of processor error rates to make appropriate cost/reliability tradeoffs. This paper describes a method for generating these estimates. A key aspect of this analysis is that some single-bit faults (such as those occurring in the branch predictor) do not produce an error in a program's output. We define a structure's architectural vulnerability factor (AVF) as the probability that a fault in that particular structure do not result in an error. A structure's error rate is the product of its raw error rate, as determined by process and circuit technology, and the AVF. Unfortunately, computing AVFs of complex structures, such as the instruction queue, can be quite involved. We identify numerous cases, such as prefetches, dynamically dead code, and wrong-path instructions, in which a fault do not affect, correct execution. We instrument a detailed 1A64 processor simulator to map bit-level microarchitectural state to these cases, generating per-structure AVF estimates. This analysis shows AVFs of 28% and 9% for the instruction queue and execution units, respectively, averaged across dynamic sections of the entire CPU2000 benchmark suite.
Techniques to Reduce the Soft Error Rate of a High-Performance MicroprocessorChristopher Weaver, Joel Emer, Shubhendu S. Mukherjee et al.|ACM SIGARCH Computer Architecture News|2004 Transient faults due to neutron and alpha particle strikes posea significant obstacle to increasing processor transistor counts infuture technologies. Although fault rates of individual transistorsmay not rise significantly, incorporating more transistors into adevice makes that device more likely to encounter a fault. Hence,maintaining processor error rates at acceptable levels will requireincreasing design effort.This paper proposes two simple approaches to reduce errorrates and evaluates their application to a microprocessor instructionqueue. The first technique reduces the time instructions sit invulnerable storage structures by selectively squashing instructionswhen long delays are encountered. A fault is less likely to cause anerror if the structure it affects does not contain valid instructions.We introduce a new metric, MITF (Mean Instructions To Failure),to capture the trade-off between performance and reliability introducedby this approach.The second technique addresses false detected errors. In theabsence of a fault detection mechanism, such errors would nothave affected the final outcome of a program. For example, a faultaffecting the result of a dynamically dead instruction would notchange the final program output, but could still be flagged by thehardware as an error. To avoid signalling such false errors, wemodify a pipeline's error detection logic to mark affected instructionsand data as possibly incorrect rather than immediately signalingan error. Then, we signal an error only if we determine laterthat the possibly incorrect value could have affected the program'soutput.
A fault tolerant approach to microprocessor designWe propose a fault-tolerant approach to reliable microprocessor design. Our approach, based on the use of an online checker component in the processor pipeline, provides significant resistance to core processor design errors and operational faults such as supply voltage noise and energetic particle strikes. We show through cycle-accurate simulation and timing analysis of a physical checker design that our approach preserves system performance while keeping area overheads and power demands low. Furthermore, analyses suggest that the checker is a fairly simple state machine that can be formally verified, scaled in performance, and reused. Further simulation analyses show virtually no performance impacts when our simple checker design is coupled with a high-performance microprocessor model. Timing analyses indicate that a fully synthesized unpipelined 4-wide checker component in 0.25 /spl mu/m technology is capable of checking Alpha instructions at 288 MHz. Physical analyses also confirm that costs are quite modest; our prototype checker requires less than 6% the area and 1.5% the power of an Alpha 21264 processor in the same technology. Additional improvements to the checker component are described which allow for improved detection of design, fabrication and operational faults.
Techniques to reduce the soft error rate of a high-performance microprocessorTransient faults due to neutron and alpha particle strikes pose a significant obstacle to increasing processor transistor counts in future technologies. Although fault rates of individual transistors may not rise significantly, incorporating more transistors into a device makes that device more likely to encounter a fault. Hence, maintaining processor error rates at acceptable levels will require increasing design effort. This paper proposes two simple approaches to reduce error rates and evaluates their application to a microprocessor instruction queue. The first technique reduces the time instructions sit in vulnerable storage structures by selectively squashing instructions when long delays are encountered. A fault is less likely to cause an error if the structure it affects does not contain valid instructions. We introduce a new metric, MITF (Mean Instructions To Failure), to capture the trade-off between performance and reliability introduced by this approach. The second technique addresses false detected errors. In the absence of a fault detection mechanism, such errors would not have affected the final outcome of a program. For example, a fault affecting the result of a dynamically dead instruction would not change the final program output, but could still be flagged by the hardware as an error. To avoid signalling such false errors, we modify a pipeline's error detection logic to mark affected instructions and data as possibly incorrect rather than immediately signaling an error. Then, we signal an error only if we determine later that the possibly incorrect value could have affected the program's output.
Measuring architectural vulnerability factorsThe continuous exponential growth in transistors per chip as described by Moore's law has spurred tremendous progress in the functionality and performance of semiconductor devices, particularly microprocessors. At the same time, each succeeding technology generation has introduced new obstacles to maintaining this growth rate. Transient faults caused by single-event upsets have emerged as a key challenge likely to gain significantly more importance in the next few design generations. Techniques for dealing with these faults exist, but they come at a cost. Designers need accurate soft-error estimates early in the design cycle to weigh the benefits of error protection techniques against their costs. This article presents a method for generating these estimates.