Test Spring School
Call for Contributions
Travel to Annecy
Activities in Annecy
- Regular Papers
- Workshop-type Papers
- Embedded Tutorials
- Invited Presentations
Call for Papers
Submit a paper
ETS General Home
|Tutorial 1: Dependable Processor Design|
Dependable Processor Design
Monday, May 28, 2012, 14:00-18:30
Processor Division ARM
In this session, I'll try to bring together all the theory that you have learnt about fault tolerance and show how it can be applied in a real practical example. This example will be based around an existing dependable processor.
Embedded processors are used in many applications that require a defined level of reliability, safety and/or availability. There are many approaches to providing the required level of fault tolerance - at the circuit, logic, microarchitecture, chip and system level - each of which incurs a certain cost.
In many volume applications, for example in the automotive market, you need to achieve a balance between reaching the required level of reliability, safety and/or availability and the additional cost involved. So when designing a dependable processor, you need to consider not only the kinds of faults that might occur but also the effect that these faults might have on system operation - and think carefully about how to protect against these faults without adding too much to the system cost.
In this session, I'll start by saying what I mean by dependability in the context of an embedded processor and then take a look at safety standards and what these imply about processor design. I'll then discuss the kinds of faults that might affect a processor's operation - hard versus transient faults, latent faults & wearout mechanisms - and how these faults might be detected (and possibly corrected).
I'll then look at how the requirements for reliability, safety and availability can be translated into real systems and discuss some dependable architectures. Using the example of an embedded processor that was designed to satisfy this market, I'll look in detail at how features such as ECC on the memories, error caches, dual-core lock step and processor diversity address these requirements. I'll also look at how external monitoring hardware can be used in conjunction with an embedded processor to achieve the required dependability.
Finally, I'll discuss some future challenges in dependable processor design, including the effect of process scaling. I'll briefly describe some experimental test structures that could be used to detect and mitigate for failure mechanisms, such as wear-out.
Peter Harrod (IEEE Member '80-Senior Member '99) graduated with a BSc(Eng) from the University of the Witwatersrand in 1976 and with MSc and PhD degrees from the University of Manchester Institute of Science and Technology in 1978 and 1982 respectively.
From 1982-1985, he was a Research Engineer in the Very High Performance IC Laboratory at the GEC Hirst Research Centre, where he was involved in pattern processing for E-beam lithography and in the implementation of CMOS-SOS VLSI ICs. From 1985-1988, he was a Senior Staff Engineer in the High-End Microprocessor Group at Motorola Inc in Austin, Texas, where he did logic and circuit design for the MC68030 and MC68040 microprocessors.
In 1988, he joined Acorn Computers Ltd in Cambridge, UK, where he was a Senior Design Engineer in the Advanced R&D Department and was involved in the design of a floating point chip and carried out one of the first implementations of IEEE 1149.1 boundary scan.
ARM was spun out from Acorn Computers in 1990 and he was one of the founding team. Since then he has worked on a wide variety of CPU, SOC and debug and trace units. He is now a Manager in the CPU design group at ARM, where he continues to work on the design and verification of embedded CPUs. He has a particular interest in the areas of design for test and debug and in the design of dependable processors.
He is a Fellow of the IET and has served on several IEEE standards and conference program committees.
|Tutorial 2: Hardware- and Software-Fault Tolerance|
Hardware- and Software-Fault Tolerance, Design and Assessment of Dependable Computer Systems
Monday, May 28, 2012, 14:00-18:30
This lecture covers the main design and assessment issues that are to be considered when developing dependable computer systems. It is organized into four main parts.
After a short introduction aimed at motivating the relevance of the topic covered, the first part briefly introduces the general concepts and related terminology for dependable computing including the notions of fault, error and failure and the main approaches towards dependability: fault tolerance, fault removal and fault forecasting.
In the second part, it addresses the fault tolerance techniques (encompassing error detection, error recovery and fault masking) that can be used to cope with accidental faults (physical disturbances, software bugs, etc.) and to some extent, malicious faults (e.g., attacks, intrusions). In particular, several forms of redundancies (space, temporal, data, etc.), as well as the important notion of diversified design will be described and illustrated by means of examples.
The third part covers the methods and techniques - both analytical (stochastic processes) and empirical (controlled experiments) - that can be used to objectively assess the coverage of the fault tolerance mechanisms and then infer the level of dependability achieved. The actual impact of fault-tolerant architectures on dependability, leading to the essential notion of coverage (with respect to fault tolerance) is precisely identified and exemplified. A special focus is put on controlled experiments based on fault injection techniques (hardware-, simulation-, and software- based fault injection).
The fourth and last part describes most recent trends in controlled experiments aimed at developing benchmarks for robustness testing purpose and for fairly comparing the dependability features of several computer systems and components.
Finally, a few concluding remarks will depict some emerging challenges and future trends in the domain of dependable computing.
Jean Arlat was born in Toulouse (FR) in 1953. He received the Engineer diploma from the Toulouse National Institute of Applied Sciences (INSAT) in 1976 and the Docteur-Engineer and Docteur Des Sciences degrees from the Toulouse National Polytechnic Institute (INPT) in 1979 and 1990, respectively. He has been with LAAS-CNRS since 1976, where he currently holds a position of Directeur de Recherche at CNRS, the French National Organization for Scientific Research, within the Dependable Computing and Fault Tolerance Group that he led from 2003 to 2008. In January 2011, he was appointed as deputy director of the laboratory and he is currently the director. From 2007 to 2010, he coordinated the research area on Critical Information Systems, one of the 4 scientific domains characterizing LAAS research activities.
His research interests include the architecting of safe and secure embedded computerized systems and the dependability assessment of computer systems - using both analytical modeling and experimental approaches (especially, fault injection). He authored or co-authored more than 120 papers for international and national journals and conferences 3 books and 21 book sections.
He has contributed to several European projects and networks and also various contracts with industry. From that respect, from 1997 to 2000, he led the Laboratory for Dependability Engineering (LIS) set between LAAS and five leading companies: Airbus, Astrium, Electricité de France, AREVA TA and Thales. Subsequently, from 2001 to 2004, he coordinated the activities of the Network for Dependability Engineering (RIS) that extended the cooperation started within LIS.
From 1999 to 2005, he chaired the IFIP Working Group 10.4 on Dependable Computing & Fault Tolerance and received the IFIP Silver Core Award in 2007. In France, he is currently a member of the Board of Directors of the Association of Carnot Institutes that gathers selected research labs featuring significant partnership with industry.