# **Fault Tolerant Computing** Prof. David August/Prof. David Walker # **Transistors** Los Alamos National Lab ASC Q 2048-node supercomputer crashes regularly [Michalak 2005] In 2000, Sun server systems deployed to America Online, eBay, and others crash [Baumann 2002] 22 SEARCI **Technology News** Sun tries to cope with server flaw Worthwhile? b 🥍 + 0 ZDNet Tags: Sun Microsystems Inc For the past year, Sun Microsystems Inc. has struggled to solve a mysterious fault that can cause its high-end servers to crash unexpectedly, an embarrassing problem for a computer maker that routinely refers to its servers as "rock solid" reliable. The Palo Alto, Calif., company said the problem, which was eventually traced to a memory flaw, is rare and probably affects fewer than 1% of all computers it has sold. Since the problem was first identified in its servers. the large computers often used to manage databases and handle e-commerce tasks, Sun (sunw) engineers have put together a variety of hardware and software fixes that pear to reduce the risk of spontaneous crashe Forbe's Magazine, November 13, 2000: "It's ridiculous. I've got a \$300,000 server that doesn't work. The thing should be bulletproof," says [Bell South's] president. "it was found that a single soft fail ... was causing an entire interleaved system farm (hundreds of computers) to crash." [Cypress Semiconductor reports in 2004, SER: History, Trends, and Challenges 2004] "a single soft error brought a to halt every month." [Cypress Semiconductor reports in 2004, Mukherjee 2007] billion-dollar automotive factory ### **History Provides More Clues** 1978: Discovery of soft errors from alpha-particles – "May and Woods Incident - Intel 2107 series 16Kb DRAMS. - Trace radioactivity in ceramic packaging - Downstream from uranium mine in CO! 1981: IBM experiences soft error problems - 16 Kbit DRAM memory chips - Radioactive Kr85 contaminating the packaging - Chemical used to clean the containers holding acid - Approximately 2% of the modules contaminated 1992: Radioactive thorium in bat guano! - Bat cave near Louisiana mine producing raw material for phosphoric acid used to etch aluminum wires in chip - Shut down semiconductor factory for 8 weeks ### History Provides a Clue 1954-1957: Discovery of soft fails in digital electronics – Nuclear Bomb Testing 1975: Soft errors in satellites from **solar** *particles* – 4 errors in 17 satellite-years - Hughes Aircraft ### Today's Problem Foretold by Profits 1978: IBM predicts soft errors from cosmic rays "An alpha-particle can cause a sudden burst of 1M electrons in a semiconductor over a path length of a few microns. This was the dimension of the new 16Kb FET memory cells." [Zeigler et al. 1998] ### **Soft Errors** - For a soft error to occur at a node in a circuit, the collected charge Q at that node must be more than $Q_{critical}$ . - Q<sub>critical</sub> is proportional to the node capacitance and the supply voltage. - As CMOS device sizes decrease, the charge stored at each node decreases (due to lower nodal capacitance and lower supply voltages). ### **Soft Errors** - As silicon process shrinks, the node area decreases, lowering the probability of a particle hitting. This lowers per bit error rate. - There are more bits per chip, so the system is now less reliable. Moore's Law as it relates to reliability: Error rate doubles every generation! ### Aircraft Control - Altitude of 30,000 feet 100x error rate - Use four 1M 130nm SRAM-based FPGA (0.074 upsets/day) - One system per aircraft - 4,000 flights over the north pole/day (increased neutron flux) - 37 aircraft will experience an FPGA configuration error everyday! ### **Logic Failures** ### Soft errors become hard truth for logic By Ron Wilson David Lammers , EE Times May 03, 2004 (9:39 AM EDT) URL: http://www.eetimes.com/article/showArticle.jhtml?articleId=19400052 Phoenix — Those nasty neutrons that have plagued memory chip designers for the past two decades are now giving logic designers a headache, too. But while error correction coding has reduced soft-error rates (SERs) in DRAMs and SRAMs, no such quick fix exists for logic, and all current solutions involve extra cost and a drag on performance. "Logic SER may become as significant as SRAM error rates," predicted Hans Stork, the chief technology officer at Texas Instruments Inc. (Dallas), in a keynote speech here last week at the International Reliability Physics Symposium. Soft errors in logic devices are a growing concern for mission-critical systems such as servers, automotive ICs and networking equipment. Logic chip vendors already are working with system customers on ways to guard against the effects of cosmic rays and alpha particles emitted from packaging. However, reliability engineers at last week's symposium said no easy solutions exist. In the case of ASIC designs, the implications are different and the countermeasures reach deeper into the system design. # Measuring ## **Important Definitions** - SDC: Silent Data Corruption - DUE: Detected & Unrecoverable Error - TRUE DUE: Affects Output - FALSE DUE: Doesn't Affect Output - SER: Soft Error Rate (SDC + DUE) ## Measuring Faults: Interval Based - MTTF = Mean Time to Failure - MTTR = Mean Time to Repair - MTBF = Mean Time Between Failures = MTTF + MTTR - Availability = MTTF / MTBF - MITF = Mean Instructions to Failure [Mukherjee] - MWTF = Mean Work to Failure [Reis] - Performability = MWTF ## Measuring Faults: Rate-Based - FIT = Failure in Time - 1 FIT = 1 failure in a billion hours - 1 year MTTF = $10^9 / (24 * 365)$ FIT = 114,155 FIT - SER FIT = SDC FIT + DUE FIT | Chip | Physically bombard with<br>neutrons from nuclear source<br>(Accelerated Testing) | |----------------------------|----------------------------------------------------------------------------------| | Circuit<br>Models +<br>RTL | Obtain raw error rate<br>Statistical fault injection | # Solutions ### Not All Errors Result in a Problem - Architectural Correct Execution (ACE) - Some bits are ACE - Some are unACE - Logical masking - Performance enhancing instructions - NOPs - Wrong-path instructions - Idle units - Dynamically dead instructions and registers ## Physical Solutions are Hard - Shielding? - No practical absorbent (e.g., approximately > 10 ft of concrete) - Radiation-hardened cells? - Improvement possible with significant penalty in performance, area, cost ## **Design Tradeoffs** - Chip layout area penalties - Latch areas increase from ~3x to >5x - Operating frequency penalties - Setup time increases by twice the sampling AT - Maximum frequency dependency: $$1/F_1 = 1/F_0 + 2\Delta T$$ $F_1$ and $F_0$ = maximum frequencies $\Delta T$ = Sampling delay time ## Fault Protection is Expensive - IBM historically adds 20-30% additional logic for mainframe processors for fault tolerance [Slegel 1999] - In 2003, Fujitsu released SPARC64 with 80% of 200,000 latches covered by transient fault protection [Ando 2003] 70 # The hamming distance between code words is 3. Allows: Error DETECTION for Hamming Distance = 1. Error CORRECTION for Hamming Distance = 1. # Memory Protection # Error-Correcting Codes Example: Hamming Codes $P_1P_2B_3P_4B_5B_6B_7$ e.g. If B3 flips $P_1\oplus B_3\oplus B_5\oplus B_7=0$ 1 $P_2\oplus B_3\oplus B_6\oplus B_7=0$ 1 = 3 $P_4\oplus B_5\oplus B_6\oplus B_7=0$ 0 with $2^{\rm K}>={\rm m+k+1.}$ m # data bit, k # check bit For 64 data bits, needs 7 check bits | Word<br>Length | | Area Increase<br>for ECC bits | Delay<br>(EXOR-Gate Tree Depth) | |----------------|---|-------------------------------|---------------------------------| | 16 | 5 | 31% | 4 | | 32 | 6 | 19% | 5 | | 64 | 7 | 11% | 6 | | 128 | 8 | 6% | 7 | | 256 | 9 | 3.5% | 8 | Overhead for ECC ### Solutions for Temporal Multiple Errors - Natural Effects - Whenever a processor reads a cache block, we can correct the single bit error - Check for errors when cache blocks are replaced from the cache - More Powerful ECC - Scrubbing - Periodically read memory and correct all single bit - Disallows accumulation of temporal double bit errors - Standard technique in main memories (DRAMs) # Impact of Scrubbing on Temporal Double Bit MTTF 16 Gigabyte Cache 1000000 FIT/bit • For 16 gigabytes of cache, scrubbing can help • IBM DUE MTTF goal of 10 years # **Parity Caches** Index W 4-way set ECC for tag (6 bits) Data (512 X 32 Bytes) Status bits ECC unit → Error flags ## **Efficient Cache Integrity Approaches** - Parity Caches [Kim et. al. ISCA 1999] - Shadow Checking [Kim et. al. ISCA 1999] - In Cache Replication [Zhang et.al. DSN 2003] **Shadow Checking** Processing unit Shadow2 cache Shadow cache Comparator or voter Second level cache (L2) or memory ### ICR: In-Cache Replication ### Simple Idea: Replicate actively used data within the cache by evicting data that may not be needed. ## ICR: Exploring the Design Space - How aggressively to predict dead blocks? - When to replicate? - Where to replicate? - How aggressively to replicate? - How many replicas do we need? - How to protect cache blocks? - How to place a replica in a set? ### Recall Sun Microsystem's Problems - In 2000, Sun server systems deployed to America Online, eBay, and others crash. - Cache memory is a large target for cosmic rays - UltraSPARC II ECC mechanism was defective - Proved damaging to Sun's image # Microarchitecture ## Redundant Multithreading (RMT) RMT = Multithreading + Fault Detection (& Recovery) | | Multithreading<br>(MT) | Redundant<br>Multithreading<br>(RMT) | |---------------------------------|-----------------------------------------|------------------------------------------------| | Multithreaded<br>Uniprocessor | Simultaneous<br>Multithreading<br>(SMT) | Simultaneous &<br>Redundant<br>Threading (SRT) | | Chip<br>Multiprocessor<br>(CMP) | Multiple Threads running on CMP | Chip-Level<br>Redundant<br>Threading (CRT) | ## **AR-SMT Delay Buffer** - Simple, fast, hardware-only state passing for comparing thread state - Ensures time redundancy: the A- and R-stream copies of an instruction execute at different times - Buffer length adjusted to cover transient fault lifetimes - Delay Buffer contains perfect "predictions" for R-stream! ### AR-SMT Fault Detection/Recovery - Fault detected when thread state does not match - Committed R-stream state is checkpoint for recovery - Introducing a second, redundant thread increases execution time by only 10% to 30% ### SRT: Input Replication - Allow both loads to probe cache: false faults with I/O or shared mem - Load Value Queue (LVQ) - pre-designated leading & trailing threads # Dual Instruction Execution (DIE) DIE [Ray et al., MICRO'01] Instructions Replicated at Decode Fetch Decode Execute/Writeback Commit Execute/Writeback Computation Checked at Commit ### SRT: Temporal Redundancy - Provides protection for random logic from transient faults - Execute multiple copies of same instruction over time - Minimal hardware overheads - Performance Overheads - 30% IPC drop on superscalar (Ray et al. '01) - 21% IPC drop on SMT (Vijaykumar et al. '02) - Reason for IPC drop? - Resource contention ### **DIE: Processor Resources** - Contended Resources - ALUs - Issue Window/ROB/PRF (RUU) entries - Decode/Issue/Commit Bandwidth - Un-contended Resources - Fetch Unit - Memory Ports # Software/Compiler ## **Software Techniques** - Can be applied to existing hardware - Can be applied to existing applications - Can be used today to increase reliability ## **Basic Philosophy** If a tree falls in the forest, but nobody is around to hear it, does it make a sound? If a fault affects some data, but does not change the output, does it make a error? Only store operations effect output, so validate data before stores. ### Control Flow Protection: Example Compute signature before branch Validate on block entrance br L\_T (r1' > 0) $r_sig = 1$ jmp L\_F $L_T: r_sig = 2$ L F: br L2 (r1 > 0)Block 1 Block 2 L1: br faultDet L2: br faultDet r\_sig != 2 r\_sig != 1 br faultDet, br faultDet, Block 9 r2 != r2' r2 != r2' L9: br faultDet st [r2] = attack r\_sig != 9 br faultDet r2 != r2' st[r2] = stay ### Software Modulated Fault Tolerance - Software flexibility allows tradeoff between performance and reliability - Tune redundancy based on function reliability - Compared to best hybrid - Same reliability - Better performance - 6.4% speedup Hybrid ### Review of Instruction Level Techniques ### Hardware solutions - Lockstepping [Stratus DMR] - RMT [Reinhardt & Mukherjee, ISCA Hardware cost No application software changes Fixed solution applied to all Visibility into all state More resources reduce performance degradation ### Software solutions - EDDI, CFCSS [Oh et al.] - Source-to-source [Rebaudengo No hardware cost Require software changes Flexibility to continually trade off reliability and costs in the field Visibility limited to architectural state Fixed resources Hybrid solutions take benefits from both Tradeoff hardware, performance, and reliability ### Redundant Multithreading Hardware-only approach Redundant code executes in separate engine 2 hardware context Hardware requirements: Multi-threaded machine Load Value Oueue · to ensure data loaded from memory is identical to both hardware contexts Pipeline Checking Store Buffer compare both versions of data before committing data to memory No software changes Fixed redundancy for application Checking Only half of the hardware contexts Queue Buffer available to Operating System ``` Schedule Thread 1 Thread 2 1d r1 = [r2] 1d r1' = [r2'] 1d r1' = [r2'] add r1 = r1 + 1 add r1' = r1' +1 add r1' = r1' +1 st[r2] = r1 st[r2'] = r1' st[r2'] = r1' Instruction duplication, register allocation, and scheduling can be moved into software Hybrid scheme: CRAFT - CompileR Assisted Fault Tolerance [Reis et al.] ``` ## **Fault Detection Requirements** Mechanism to ensure original and redundant reads from memory receive same values Mechanism to create redundant computation Mechanism to compare original and redundant results before writes to memory Mechanism to guarantee correct control flow **Control Flow Protection** - Original and redundant computation in one thread - Incorrect control flow will divert both versions - Redundant and original may compute the same, but incorrect value - Compiler adds instructions to compute redundant PC - Set before branch - Validate at destination - Not perfect, but effective br (b == 0) mov r1 = 1 mov r1' = 1 mov r1' = 0 Fault Detection Requirements Mechanism to ensure original and redundant reads from memory receive same values Mechanism to create redundant computation Mechanism to compare original and redundant results before writes to memory Mechanism to guarantee correct control flow ### Hybrid Reliability - Reliability evaluated using fault injection (3 structures) - Single bit flip per execution - 5000 injection executions per structure per benchmark per system - Use combination of microarchitectural and architectural simulation - Mean Work To Failure - Encompass longer execution time and increased reliability - Generalization of Mean Instructions To Failure [Weaver et al. ISCA '04] - · Instruction not constant unit of work in hybrid systems - Proportional to: 1 / (Architectural Vulnerability \* Execution time<sub>unit of work</sub>) 42 ## **Hybrid Fault Tolerance Summary** "Design and Evaluation of Hybrid Fault-Detection Systems" [Reis ISCA-32 05] ``` 1d r1 = [r2] 1d r1 = [r2] 1d r1' = [r2'] 1d r1' = [r2'] add r1 = r1 +1 add r1 = r1 + 1 add r1' = r1'+1 add r1' = r1' + 1 br faultDet, r1 != r1' br faultDet, r2 != r2' st[r2] = r1 st[r2] = r1 st [r2'] = r1' Better performance than SWIFT: Better reliability than SWIFT: 3% speedup 75% reduction in abnormal execution 25% reduction in incorrect output ``` ### Acknowledgements - Vijay Narayanan - Shubu Mukherjee - George Reis - Mark Bohr