A generic methodology to compute design sensitivity to SEU in SRAM-Based FPGA

*Citation for published version (APA):*

*Document license:*
Unspecified

*DOI:*
10.1109/DSD.2018.00050

*Document status and date:*
Published: 12/10/2018

*Document Version:*
Accepted manuscript including changes made at the peer-review stage

*Please check the document version of this publication:*
- A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher’s website.
- The final author version and the galley proof are versions of the publication after peer review.
- The final published version features the final layout of the paper including the volume, issue and page numbers.

*Link to publication*

*General rights*
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

- Users may download and print one copy of any publication from the public portal for the purpose of private study or research.
- You may not further distribute the material or use it for any profit-making activity or commercial gain
- You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne

*Take down policy*
If you believe that this document breaches copyright please contact us at:
openaccess@tue.nl
providing details and we will investigate your claim.

Download date: 27. Mar. 2019
A Generic Methodology to Compute Design Sensitivity to SEU in SRAM-based FPGA

Mahsa Mousavi and Hamid Reza Pourshaghaghi
Technical University of Eindhoven
Email: m.mousavi, h.r.pourshaghaghi@tue.nl

Mohammad Tahghighi
Hong Kong University of Science and Technology
Email: mtahghighi@ust.hk

Roel Jordans and Henk Corporaal
Technical University of Eindhoven
Email: r.jordans, h.corporaal@tue.nl

Abstract—Recently, SRAM-based FPGAs are widely used in aeronautic and space systems. As the adverse effects of radiations in space are much higher than in the Earth, developing fault tolerant techniques play crucial roles for the use of electronics in space. However, fault tolerance techniques might introduce additional penalties in area, power, performance and design time. In order to compromise between the overhead introduced by these techniques and system fault tolerance, a generic methodology for calculating design sensitivity to Single-Event Upset (SEU) is proposed in this paper. Separate schema and test-bench for evaluating effects of SEU in various types of FPGA memory are proposed in which both the raw device error rate and the vulnerability characteristic of the specific application mapped on the device are taken into account. Experimental results show that using our model in order to selectively add Triple Modular Redundancy (TMR) improve the design robustness only 18% less than full TMR while roughly introduces 69% less redundancy compared to full TMR for Fast Fourier Transform (FFT).

Index Terms—Fault tolerance, FPGA, Single-Event Upset (SEU), Raw device error rate, Design vulnerability, Simulation-based injection

I. INTRODUCTION

The interest to use Commercial Off-The-Shelf (COTS) Static-RAM Field Programmable Gate Arrays (FPGAs) in aeronautics and space missions have been significantly increased because of the features offered by them. During past years, the Europian Space Agency (ESA) has strongly stressed the need for new generation of high-performance general purpose digital signal processing platforms for future space missions based on COTS [1],[2]. Operation capacity, performance and relatively low power consumption combined with reconfigurable properties make FPGAs ideal processing devices to be used in satellites and space systems. However, COTS SRAM-based FPGAs are not manufactured for operation in space [3] and therefore the real challenge is how to develop a radiation-resistant solution to ensure the error-free function of COTS in deep-space.

Charged particles striking the silicon substrate in space may cause an error in FPGAs and make them unreliable processing platform. Single-event upsets (SEUs) are a major cause of error in contemporary FPGAs, among other radiation-induced faults [4]. An SEU is a change of memory cell state caused by one single ionizing particle e.g. ions, electrons, photons, etc, striking a sensitive node in a microelectronic device. Because this type of the fault does not reflect a permanent error of a device, it is termed as soft or transient error. Device error rate due to single event upsets is determined by the device circuit and environment characteristics. With shrinking the size of transistors, the probability of SEU in a memory cell leads to an error will increase as smaller deposit charge is capable to induce error in the transistors. On the other hand, as the size of memory cell decreases, the number of high energetic particles strike one cell decreases. The combined effect is that the error rate in one memory cell usually remains roughly constant or it even decreases. However, the error rate for an entire device exponentially increases for new generations as the total number of bits in a device exponentially increases according to the Moore’s Law. Thus, fault detection isolation and recovery (FDIR) techniques are highly required for new generation devices.

A variety of fault tolerance techniques are already developed and exist, from special radiation-hardened circuit designs (e.g., [5]) to reliability-oriented place and route algorithm (RoRA) and to architectural redundancy (e.g., [6]-[8]), also scrubbing of CM (e.g., [9]-[12]). However, many of these approaches introduce significant penalties for performance, power consumption, logic utilization, and/or design time. Thus, the benefits of applying fault-tolerant techniques versus their extra cost must be carefully weighed by FPGA designers. A good estimation of design sensitivity to soft errors provides circuit designers with some key criteria to balance between fault tolerance features added to the design and overhead costs. The currently existing techniques and previously presented fault models for estimating FPGA design sensitivity to SEU still lack a comprehensive, scalable and clear technique.

This paper, therefore, presents a generic model for estimating FPGA design sensitivity based on the fact that not all the faults affect or change final output results of a particular design. The main contributions of this paper, among others, include:

- proposing a generic model for estimating FPGA design sensitivity
- incorporating both raw device probability and design architecture in our model
- contemplating of SEU effects on all different types of FPGA memories i.e. configuration memory (CM), BRAMs, and flip-flops in our model

A variety of fault tolerance techniques are already developed and exist, from special radiation-hardened circuit designs (e.g., [5]) to reliability-oriented place and route algorithm (RoRA) and to architectural redundancy (e.g., [6]-[8]), also scrubbing of CM (e.g., [9]-[12]). However, many of these approaches introduce significant penalties for performance, power consumption, logic utilization, and/or design time. Thus, the benefits of applying fault-tolerant techniques versus their extra cost must be carefully weighed by FPGA designers. A good estimation of design sensitivity to soft errors provides circuit designers with some key criteria to balance between fault tolerance features added to the design and overhead costs. The currently existing techniques and previously presented fault models for estimating FPGA design sensitivity to SEU still lack a comprehensive, scalable and clear technique.

This paper, therefore, presents a generic model for estimating FPGA design sensitivity based on the fact that not all the faults affect or change final output results of a particular design. The main contributions of this paper, among others, include:

- proposing a generic model for estimating FPGA design sensitivity
- incorporating both raw device probability and design architecture in our model
- contemplating of SEU effects on all different types of FPGA memories i.e. configuration memory (CM), BRAMs, and flip-flops in our model
proposing two different test benches for evaluating SEU effects on most important FPGA memory types
• experimentally evaluating the proposed model with a common signal processing application in digital systems i.e. Fast Fourier Transform (FFT)

The rest of paper is organized as follows: Section II gives a brief overview of the related work and fault-tolerant techniques that have been published. The proposed methodology to calculate design sensitivity in FPGA designs is described in section III. In section IV, test bench implementation for simulation-based fault injection is explained and educational examples are summarized in Section V. Finally, section VI conclude the paper.

II. RELATED WORK

A lot of efforts has been already spent looking into soft-error effects in SRAM-Based FPGAs: simulation-based, radiation-based, modeling and analysis-based approaches [13]-[24]. The first two approaches mainly exploit Fault-Injection strategies where the process of fault injection is achieved either using simulation tools or radiation equipment.

In radiation-based methods [13]-[14], a prototype model of a system under test is exposed to a flux of radiations generated either by radioactive sources or particle accelerators. The prototype FPGA under test then is continuously stimulated by a given set of input stimuli and the system outputs are compared with expected output values. Radiation-based methods are very expensive and capable of damaging the design under the test permanently. They are mainly used for device characterization, not for evaluation of the vulnerability of a particular mapped design.

The analysis-based methods [15],[16] make use of the analytical approach to evaluate SEU effects. These methods only require synthesis tool and software program and remove the necessity of any physical implementations. In these methods, first the probability of error in different design nodes is computed by modeling the error rate based on fault-behavior in different FPGA elements such as LUTs, multiplexers, programmable interconnect point (PIP) and etc. Next, the error propagation probability is computed using gate-level net-list. The vulnerability of a particular design is calculated afterward based on obtained error probability in nodes and probability of error propagation. The major drawback in these methods is in the accuracy of calculations, as only the model of some particular error behaviors is taken into account. In addition, these techniques are not easily applicable for various FPGA types as detailed information and implementation characteristic of FPGA devices are also required.

In simulation-based methods [17]-[24], SEU effects are estimated with flipping bits of CM while running a design on FPGA.

M. Wirthlin et al. [17] assessed the reliability of the FPGA circuit in the presence of configuration upsets. They developed an SEU simulator in order to artificially inject upsets into the CM and measure their impact on different FPGA designs. The simulator is based on the SLAAC-1V computing board consisting of three Virtex V1000 devices. Implementation results on two FPGAs are compared, one generates the golden results and the other produce results in presence of upsets.

L. Sterpone et al. [22] exploit from partial reconfiguration to develop a system for injecting faults into CM. A host computer controls and collects SEU simulated results and loads a list of SEU to be injected into the on-chip memory. Then hardened Power-PC microprocessor reads the SEU list through PLB bus and executes the partial reconfiguration fault injection.

M. Violante et al. [23] proposed a method for calculating the probability (is named circuit cross section in the paper) of system failure due to SEU in CM by combining the physical error rate which only depends on the FPGA device and the probability of errors alter the correct output of the circuit. The former is calculated by radiation-based injection and latter is calculated by simulation-based error injection.

Paul Graham et al. [19] investigate SEU effects in different FPGA configurable components. In a particular type of FPGA, they determine how much percentage of a total design failure is because of SEUs. The configurable components are categorized into five main type including mux select line, programmable interconnect point, buffer enables, LUT and control bit. It is been shown that most of the observed failures are due to routing structures.

In spite of radiation and analysis based methods, which consider all sensible design elements to SEU, the simulation-based papers only assess effects of SEU on CM. The reason is that in most of FPGAs CM contributes in a large number of total memories (i.e. 97%). However, this contribution of CM decreases in new generations of FPGAs to 80%. Table I compares the contribution of different memory elements for an older FPGA i.e. Virtex V1000 and a new one i.e. Kintex 7. Moreover, this contribution rate decreases even more when only actual memory usage for a design is taken into account. For example, from Xilinx report [25] summarized in table I, for a typical design only 36% of the memory bits belong to configuration bits.

One other major drawback of previous simulation-based techniques is that any obtained error rate calculation results cannot be referred for comparison purposes between SEU sensitivity of two different designs because SEU effect is only considered for CM and not for the entire design. Therefore, for any designs that consume more FFs or BRAMs, considerable parts of the design sensible to SEU are not accounted in SEU sensitivity. For example, the results in [17] show that the LFSR (linear feedback shift register) sensitivity to configuration SEU is less than that of the multiplier design. This is because LFSR mainly consists of flip-flops and uses less combination logic and routing in comparison with the multiplier. In order to

<table>
<thead>
<tr>
<th></th>
<th>CM</th>
<th>BRAM</th>
<th>FF</th>
</tr>
</thead>
<tbody>
<tr>
<td>Kintex7</td>
<td>81%</td>
<td>17%</td>
<td>0.5%</td>
</tr>
<tr>
<td>Virtex V1000</td>
<td>97%</td>
<td>2.2%</td>
<td>0.4%</td>
</tr>
<tr>
<td>Typical design</td>
<td>36%</td>
<td>62%</td>
<td>2%</td>
</tr>
</tbody>
</table>
of occurring simultaneous error in more than one memory element caused by a high energy particle is so small and negligible, DS could therefore be derived by summation of multiplication of partial $\epsilon_{des}$ and partial DVF for all types of design resources affected by the fault. Considering the partial calculation, (1) for design $L$ can be rewritten as:

$$DS_L = Pr(Err \text{ in } CM_L) Pr(failure|Err \text{ in } CM_L) + Pr(Err \text{ in } BRAM_L) Pr(failure|Err \text{ in } BRAM_L) + Pr(Err \text{ in } DiRAM_L) Pr(failure|Err \text{ in } DiRAM_L) + Pr(Err \text{ in } FF_L) Pr(failure|Err \text{ in } FF_L)$$

$$0 \leq DS_L \leq 1$$

(2)

where $Pr(A)$ is the probability of A happening and $Pr(A|B)$ is the probability of A happening given B has already occurred. For instance, in (2), $Pr(Err \text{ in } CM_L)$ represents partial $\epsilon_{des}$ and is interpreted as the probability of fault occurring in CM that belongs to design L. Likewise $Pr(failure|Err \text{ in } CM_L)$ represents partial DVF corresponding to CM and is defined as the probability of failure in the output of design, given that fault has been occurred in the CM of design L. In (2), partial $\epsilon_{des}$ as well as partial DVF need to be computed separately for various types of FPGA memory. Note that partial $\epsilon_{des}$ is depending on the cross-section of the memory which varies for each memory types. Similarly, DVF corresponding to different memory types need to calculated separately because faults are modeled differently and their impacts on design functionality are also different when they happen in each of dissimilar memory types. For example, SEUs in CM is modeled as permanent faults, considering there exists no mechanism for correcting them when they are faulty. However, the SEUs in FFs are modeled as transient faults since new healthy data might be written to those FFs in next clock cycles. Besides, faults in CM change the functionality of a circuit while faults in FFs only impact on the state of a design. In the following subsections, we discuss partial $\epsilon_{des}$ and DVF calculation methods for each of different memory types.

A. Raw Design Error Rate

When high energy particles such as galactic cosmic rays, cosmic solar particles and trapped protons in radiation belt hit silicon devices, electric charge is deposited which can produce a wide range of effects. These effects are expressed by Linear Energy Transfer (LET), in $MeV \cdot cm^2/mg$ referring to that amount of energy transferred along the particle path per one material volumetric mass density. LET depends on radiations amount and the material transferred. These effects can damage an electronic device temporarily or permanently. Permanent effects like single-event latch-up (SEL) or single-event gate rupture (SEGR) are beyond the scope of this paper. Permanent faults also require mitigation schemes like shielding, but may not be always unavoidable. Instead, temporary effects can be fixed and the functionality of electronic circuits can be recovered by applying fault-tolerance techniques. Examples of temporal faults are single-event transient (SET), single-event upset (SEU) and multiple-event upset (MEU).

III. METHODOLOGY

Figure 1 shows the process of how a failure is generated in a design. When high energetic particles hit the design, it may lead to SEU in memory elements of design, shown as error in fig. 1. In this situation, failure only happens if this error propagates through the given design and becomes visible at the final output of design. The key is that not all high energetic particles struck a device leads to SEU as well as not all SEUs lead to failure. Raw device error probability ($\epsilon_{des}$) indicates the probability of an SEU occurring in a device, which is perceived to be unconstrained from the circuit running on that device. The multiplication of $\epsilon_{des}$ and percentage of device resources utilized by a given design will indicate raw design error probability ($\epsilon_{des}$). The Design Vulnerability Factor (DVF) is defined as the probability of an happened error in a design resulting in a visible error in the final output which depends only on design architecture.

![Fig. 1: Process of failure generation](image)

The probability of failure for a design implemented in FPGA is defined as Design Sensitivity (DS), which is dependent on both device characterization and design architecture. Since not all SEUs lead to a failure in the final output of a given design, $\epsilon_{des}$ is not sufficient to estimate the sensitivity of the design to errors. Thus, considering only $\epsilon_{des}$ to determine design sensitivity will be an excessively high estimate and probably introduce the risk of increasing design costs by adding unnecessary fault-tolerant features. Thus DS is defined based on $\epsilon_{des}$ and DVF as follows:

$$DS = \epsilon_{des} \cdot DVF, \ 0 \leq DS \leq 1$$

(1)

Since a design composes of various resources, in order to calculate DS in (1), the multiplication of partial $\epsilon_{des}$ and partial DVF are obtained for each type of the resources in a design, separately. The probability that the resources with the same type in the design become erroneous is defined as partial $\epsilon_{des}$ and partial DVF is the probability that the error in that type of resources leads to visible failure in the output. Since SEU only affects the state of a memory, the partial multiplications are only calculated for various types of memory in FPGA which are the configuration memory (CM), block RAMs (BRAM), distributed RAMs (DiRAM) and flip-flops (FF). CM is the largest amounts of FPGA total memory bits and stores the functionality of a design. Block RAMs, distributed RAMs, and flip-flops contain the state of a design. Since the probability

compare these two design in a fair manner, flip-flop sensitivity to SEU is also required to be evaluated.
TABLE II: the cross-section ($cm^2$/bit) of various memory types for LET=20 $MeV \cdot cm^2/mg$ [28] and [27]

<table>
<thead>
<tr>
<th>Model</th>
<th>CM</th>
<th>BRAM</th>
<th>FF</th>
</tr>
</thead>
<tbody>
<tr>
<td>28nm Kintex7</td>
<td>4.12e-9</td>
<td>7.94e-9</td>
<td>5.0e-9</td>
</tr>
<tr>
<td>Zynq-7000</td>
<td>1.40e-9</td>
<td>2.14e-9</td>
<td>2e-9</td>
</tr>
</tbody>
</table>

A SET is a temporary voltage spike accrued in combinational logic of the FPGA. SET does not result in operation failure of circuit running on the device except when it is captured with flip-flops. An SEU is a single bit flip in the memory elements of the circuit including flip-flops, latches and SRAM cells which potentially affects the correct function of a circuit. An MBU is the state change of two or more adjacent memory bits induced by a single particle strike. From [26], MBU rate in FPGAs is negligible. Configurations of circuitry in FPGAs are stored in SRAM cells and that makes FPGAs more vulnerable to SEU than ASIC chips.

Memory cross-section ($\sigma$) is the probability of an upset occurring per particle fluence which is obtained from the ratio of the number of observed SEUs by the particle fluence (particle per $cm^2$) received by the component under test during the test period:

$$\sigma = \frac{\text{number of SEUs}}{\text{fluence}} \quad (3)$$

In order to calculate $\sigma$ whole FPGA device is irradiated to a specified fluence and after stopping the beam, the value of memory cells are compared with predefined values. For each specific memory type, the number of SEUs is divided by total fluence and then is normalized by the total number of bits to obtained the $\sigma$ per bit. $\sigma$ is related to particle LET as well as memory bit characterization such as memory type, implementation technology, and memory layout. That being defined, cross sections are different for each particular memory types i.e. CM, block RAM and flip-flop and they are also different for various FPGA models. Table II shows the cross-section of various memory types for Kintex-7 and Zynq-7000 Xilinx FPGA series used from [28] and [27] experimental results, respectively.

The $\epsilon_{dev}$ associated with memory type M is calculated as follows:

$$\epsilon_{dev,M} = \left( \sum_{total} b_M \right) \cdot \sigma_M \quad (4)$$

where $\sum_{total} b_M$ is the total number of bit in memory type M and $\sigma_M$ is the cross-section per bit of that memory type.

Similarly and for memory type M, $Pr(Err \ in \ M)$ or equally $\epsilon_{des,M}$ for a specific design L is calculated as follows:

$$\epsilon_{des,M} = \left( \sum_{L} b_M \right) \cdot \sigma_M \quad (5)$$

where $\sum_{L} b_M$ is the total number of bits in memory type M that contribute in design L. Table III shows design raw error of a typical design which uses $7Mb$ configuration bits, $0.5Mb$ flip-flops and $15Mb$ block RAMs. The resources utilization numbers are obtained from Xilinx report [25] for a typical design implemented in the 28-nm Kintex-7 FPGA.

TABLE III: $\epsilon_{des}$ for typical design implemented in the 28-nm Kintex-7 FPGA based on cross-section presented in table II

<table>
<thead>
<tr>
<th>Model</th>
<th>CM</th>
<th>BRAM</th>
<th>FF</th>
</tr>
</thead>
<tbody>
<tr>
<td>Used bits</td>
<td>7Mb</td>
<td>15Mb</td>
<td>0.5Mb</td>
</tr>
<tr>
<td>$\epsilon_{des}$</td>
<td>2.88e-2</td>
<td>1.19e-1</td>
<td>2.5e-3</td>
</tr>
</tbody>
</table>

B. CM-Wise Design Vulnerability Factor

As already mentioned, in SRAM-based FPGAs design circuitry is stored in CM which is a static random-access memory (SRAM). SRAM memories are vulnerable and very sensible to SEUs and an upset in CM bits can affect or change the functionality of a design e.g. a look-up table (LUT), or routing between FPGA nodes. Figure 2 illustrates one example of such failures is a test design. Figure 2a shows a simple circuit consisting of a 2-input LUT (yellow square) and a routing switch (white square) within an FPGA. The LUT configuration bits were set to the values which implement an AND gate and inputs are connected to net A and B using a routing switch. The small yellow squares in routing switch represent configuration bits. The configuration bits of paths (N2, E1),(S2, E2) are active and connect the nets A and B to the inputs of LUT. Two scenarios are described in fig. 2b and fig. 2c in which SEU changes the circuit and causes a failure in the output result. In fig. 2b an SEU changes one bit of LUT from zero to one, exchanging AND gate functionality with an XOR gate. In fig. 2b SEU in the routing switch configuration bit leads the path (N2, E1) to be disconnected. Consequently, the LUT input is disconnected from net A.

Essential bits are fractions of CM bits that determine the circuitry of a given design in FPGAs. If an essential bit is upset it follows that the design circuitry is changed, however the upset might not have an effect on the design functionality. Those subset of essential bits that result in a failure features if they change are defined as critical bits. In 2, the probability of failure happening in design while an error occurs in CM $Pr(\text{failure}|\text{Err in CM}_L)$ is defined as CM-wise Design Vulnerability Factor (CMDFV) and is calculated as follows:

$$CMDFV = \frac{\text{critical bits number}}{\text{essential bits number}} \quad (6)$$

the essential bits are identified by Xilinx essential bits technology. However, the list of critical bits is required to be generated by validating the correct design behavior while moving an upset through all the essential bits in the design. The SEUs are modeled with bit-flip in the CM. Since the content of the CM will not change after FPGA are programmed, an upset remains until a correction mechanism modify it or FPGA is reprogrammed again.

C. Flip Flop-Wise Design Vulnerability Factor

In FPGAs, design states are stored in flip-flops which can also be influenced by SEUs. Same as configuration memories, not all the errors in flip-flops leads to a failure. For example, if a flip-flop is upset before its faulty bit of information is validated, the correct design behavior while moving an upset through all the essential bits in the design. The SEUs are modeled with bit-flip in the CM. Since the content of the CM will not change after FPGA are programmed, an upset remains until a correction mechanism modify it or FPGA is reprogrammed again.

The vulnerability of flip-flops if it is affected by SEUs is calculated as follows:

$$Pr(\text{flip-flop failure}|\text{Err in FF}_L)$$

In fig. 2c in which SEU changes the circuit and causes a failure in the output result. In fig. 2b SEU in the routing switch configuration bit leads the path (N2, E1) to be disconnected. Consequently, the LUT input is disconnected from net A.
output of circuit, in this case, the flip-flop upset does not have any impacts on circuit functionality.

Bit-flip is used to model SEU in flip-flops. This is a similar model used for CM with only one difference that flip-flop state could also be changed during execution of the circuit. This implies that injecting faults into flip-flops last only for one clock cycle. Furthermore, the fault may produce different effects if injected in different clock cycles. If \( F \) is the number of flip-flops in a circuit and \( C \) is the number of clock cycles of the design latency, the complete set of single errors is therefore \( F \times C \). The subset of these errors that result in a failure in design output is defined as effective errors. In 2, the \( \text{Pr}(\text{failure} | \text{Err in } F_{FL}) \) is defined as flip-flop-wise Design Vulnerability Factor (FFDVF) and is calculated as follows:

\[
\text{FFDVF} = \frac{\text{effective errors number}}{(F \times C)}
\]  

The effective number of errors are computed by comparing output response of both 1) error-free and 2) erroneous circuit while we allow the circuit running in presence of errors. A faultless behavior of a design is evaluated by injecting errors in flip-flops. Errors are injected in all flip-flops, and for each of them at any desired clock cycles. Assuming fault is injected, if outputs of the circuit is different from expected outputs, then the error is counted as an effective error.

The method presented in this section is also used for fault modeling of BRAMs and distributed RAMs as they react to errors similar to flip-flop. BRAM and distributed RAM design vulnerability factors \( \text{Pr}(\text{R}_{\text{failure}}|\text{Err in } BRAM_{L}) \) and \( \text{Pr}(\text{R}_{\text{failure}}|\text{Err in } BRAM_{L}) \) are therefore called BDVF and DiDVF, respectively.

IV. IMPLEMENTATION

In this section, the schemes and test bench implementations for calculating partial vulnerability factor corresponding to each type of memory are presented.

A. CM Test Bench Implementation

We implemented a test bench system in Xilinx ZYNQ-7000 FPGA to quantify critical bits. Figure 3 depicts the block diagram of our test bench system. The test bench controller is executed on embedded ARM processor of ZYNQ to perform the evaluation process in the algorithm 1. The Design Under Test (DUT) represents the target design implemented in the programmable logic (PL). The test bench controller provides the inputs and valid signal for DUT or reads its generated output via AXI bus and DUT interface. The Xilinx Soft Error Mitigation IP core (SEM IP) implemented in PL is used to inject faults into the FPGA CM. The SEM IP Controller is an automatically configured, pre-verified solution to detect, correct and inject soft errors in CM of Xilinx FPGAs. The SEM IP can access and modify CM through ICAP interface and partial reconfiguration. The test bench controller communicates with SEM IP by sending commands through UART interface. Fault injection is applied to all essential bits of design one by one in test bench. A list of essential bits is generated by the Xilinx Vivado Design Suite tool for the entire circuit implemented in FPGA, including the SEM IP. In order to determine the essential bits of DUT, Vivado tool is forced to map the design to a specific physical location of FPGA die. Then the list of design essential bits is extracted from the range of CM addresses corresponding to the design physical location. The test bench controller flow executed for each of the essential bit of design is explained in algorithm 1. In this test bench controller, first, the DUT is executed to obtain the golden output for a provided input. After that and to evaluate whether the essential bit is critical or not, an upset is injected into it by means of sending a command that includes the physical addresses of the essential bit to SEM IP core. When SEM IP confirmed that the essential bit value is toggled, then the same inputs are provided by the test bench controller for DUT and the design execution is initiated again. After that,
**Algorithm 1:** Test bench controller flow for calculating critical bits number

1. **INPUTS:** essential bits location, input-sets
2. **OUTPUTS:** critical bits number
3. for $i = 0, i < \text{input} - \text{sets number}, i++$ do
4. current input vector = read first input vector from input-set(i);
5. current essential bit = read the location of the first essential bit in essential bits list;
6. for all essential bits do
7. current input vector = read current input vector from input-set(i);
8. current essential bit = read the location of next essential bit in essential bits list;
9. if output $\neq$ golden output then
10. Add the number of critical bits;
11. Flip back the value of current essential bit;
12. current input vector = read next input vector;
13. current essential bit = read the location of next essential bit;
14. end
end

the output of the DUT in presence of the fault is compared with the golden output. If they are different, the evaluated bit is accounted as a critical bit. Finally, the value of the infected bit is flipped back by sending a new injection command with the same physical address.

A critical bit is input-dependent, which means a bit may be accounted as a critical bit for some inputs while for others may not. Thus, an average is taken across all the critical bits number for different input sets. Each input set composed of a set of input vectors that are randomly generated or generated based on input vectors used in a real application.

**V. EDUCATIONAL EXAMPLEs**

Algorithm 2 depicts the schema for evaluating the design sensitivity in presence of errors in flip-flops. The evaluation is based on simulating the behavior of the circuit using HDL model of the design. Error injection is achieved by modifying the HDL model to include fault injection capability. Thus, all the flip-flops in the HDL model are replaced with modified versions enabling us to inject faults.

Figure 4 shows the standard implementation of flip-flops and the modified flip-flop which has an extra XOR gate. In the modified flip-flop the input is XORed with an inject signal. When the inject signal is high the value of flip-flop is flipped, otherwise it is associated only with input. The inject signal must be in the input list of the top module of the design so that test bench module can access it and control it when a fault required to be injected. Algorithm 2 presents the flow of calculating FFDVF. Since it is very time-consuming to manually modify codes in big designs, for the sake of fault-injection, the net-list of the design generated with an HDL synthesizer tool is automatically parsed and modified. A sample-list file is also generated which contains inputs, expected outputs, target flip-flop and the clock cycle of injection for each simulation run. The modified net-list and sample-list of the design are used by test bench that performs the evaluation process as presented in following steps:

1. Read inputs, expected output, flip-flop number N and clock cycle number C from sample-list.
2. Run the test bench until the moment error has to be injected.
3. Inject error by activating the inject signal bit associated with the flip-flop N for one clock cycle.
4. Continue simulation until the testbench is completely executed.
5. Compare the circuit output with the expected output.

For large circuits and for long test benches, the evaluation process can become extremely long and time-consuming. The usual approach to handle this complexity is to sample the errors set in order to obtain statistical results. Moreover, in order to make calculated effective errors number independent from the inputs, the number of effective errors is calculated for some various input sets and an average is taken out of them. Thus, some sample-lists are generated that each of them composes of the list of test vectors and each test vector consists of the inputs vector, corresponding golden outputs, target flip-flop to be injected into and injection clock cycle.

The proposed test bench is also used for calculating the effective errors number in BRAM and DiRAM.

**Figure 4:** Standard flip-flop and modified inject-able flip-flop
in table V. The injections for CM show number of essential bits which are injected in and failures show the number of critical bits. The calculated DVF for both designs are aligned with the Xilinx report [29] which estimates that no more than 0.11 fraction of configuration bits are critical, even when the whole FPGA is used. In addition and as expected this result is similar to results reported in [24] for ADDER implemented in Zynq-7000.

Configuration bits corresponding to routing elements are more sensible than the ones used in LUTs, as roughly 80% of failures are due to SEU in routing elements [19]. Given the results, DVF of streaming FFT application that uses more routing elements is higher than ADDER circuit. For the FF and DiRAM, the injections show the number of error injected and failures is equal to effective errors. From the table, the FF-wise DVF is more than CM-wise, a coherent outcome since FF stores the user data and that makes it more susceptible to error than CM bits. However, because the FFT architecture rounds the calculation in middle stages of its calculation, some of the errors are inevitably masked. The DS values in table VI are obtained by multiplying the $\epsilon_{des}$ with the DVF corresponding to each memory type and add them all. For instance the DS for FFT is calculated from sum of partial product of CM i.e 0.126e − 4, partial product of FF i.e 0.366e − 4 and partial product of DiRAM i.e. 0.071e − 4. The partial product results show that the contribution of flip-flops is 2.904 times more than CM in FFT DS, although $\epsilon_{des}$ from table IV related to flip-flops is roughly 2.4 times less than configuration bits. It is certified the importance of considering the DVF in our proposed model.

In addition, these results indicate that flip-flops used in FFT are more sensible to SEUs than CM which guide designers to use the budget of adding redundancy more for flip-flops than CM. For example, one of the common mitigation techniques is Triple Modular Redundancy (TMR) [30] in which three copies of a module are implemented and correct output is selected by majority voting. If TMR is applied for FFT, about 200% of plain design, redundancy is added to design which is significant and in some cases may violate system constraints like area. In order to balance between system constraints and required fault tolerance, TMR redundancy can be applied for most sensitive component which is flip-flops.

### ALGORITHM 2: Test-bench controller flow for calculating number of effective errors

1. **INPUT:** design HDL model, sample-lists
2. **OUTPUT:** critical bits number
3. Take design HDL model;
4. Generate synthesized net-list;
5. Generate fault inject-able net-list;
6. for $i = 0, i < sample \text{ – lists number}, i + +$
   - current input vector = read first inputs vector from sample-list($i$);
   - current golden result = read first golden outputs vector from sample-list($i$);
   - current flip-flop number = read first flip-flop number from sample-list($i$);
   - current injection cycle = read first injection cycle from sample-list($i$);
7. **for all sample errors do**
   - Run simulation until current injection cycle;
   - Inject fault into current flip-flop;
   - Resume simulation until it finishes;
   - if outputs $\neq$ golden outputs then
     - add the number of effective errors;
   - end
8. Reset Simulation;
9. current input vector = read next input vector;
10. current golden result = read next golden vector;
11. current flip-flop number = read next flip-flop number;
12. current injection cycle = read next injection cycle;
13. end

### TABLE V: Partial DVF for various designs

<table>
<thead>
<tr>
<th>Design</th>
<th>CM</th>
<th>FF</th>
<th>DiRAM</th>
</tr>
</thead>
<tbody>
<tr>
<td>FFT</td>
<td>0.0227</td>
<td>0.37</td>
<td>0.28</td>
</tr>
<tr>
<td>ADDER</td>
<td>0.0227</td>
<td>NA</td>
<td>NA</td>
</tr>
</tbody>
</table>

### TABLE VI: DS for various designs implemented in Zynq-7000 FPGA

<table>
<thead>
<tr>
<th>Design</th>
<th>DS</th>
</tr>
</thead>
<tbody>
<tr>
<td>FFT</td>
<td>0.563e-4</td>
</tr>
<tr>
<td>ADDER</td>
<td>0.162e-6</td>
</tr>
</tbody>
</table>

### TABLE VII: Evaluating the importance of considering DS in fault tolerance schema

<table>
<thead>
<tr>
<th>Design</th>
<th>Full TMR</th>
<th>DS-Wise TMR</th>
</tr>
</thead>
<tbody>
<tr>
<td>FFT</td>
<td>98%</td>
<td>80%</td>
</tr>
<tr>
<td>Design-1</td>
<td>95%</td>
<td>76%</td>
</tr>
</tbody>
</table>
The proposed DS can be also used for selectively adding redundancy to various component of a complex design. For instance, consider design-1 consists of an adder and FFT which both of them located on the critical path. Table VI indicates that FFT is roughly 500 times more sensitive than ADDER. Thus, the budget for adding redundancy is selectively used for FFT flip-flops to achieve fault tolerance requirement with less redundancy. Thus, as shown in table VII with adding 70% less redundancy compared to full TMR, the design sensitivity is reduced only 19% less than full TMR.

VI. CONCLUSION

Fault tolerance techniques for protecting digital system designs and specially FPGA-based systems against SEU mainly come with additional overheads. Given the extensive interest in using COTS devices in automotive and space missions, it is essential to use fault-tolerance techniques for processing components in a system vulnerable to SEU effects. In order to compute the SEU sensitivity of a design implemented in FPGA, a generic model was presented in this paper. This model empowers digital designers to tolerably compare the criticality of various parts of a system to make use of fault tolerance features. Toward this, all of the system parameters that assess the sensitivity of a design were taken into account in this paper. Those system parameters include device characterization, design architecture and environment specification as well as various elements of the design which are sensible to SEU, i.e. CM, BRAM, distributed RAM and flip-flops. Schemes to calculate the parameters of the proposed methods were also presented followed by fault modeling and fault injection techniques. Results were certified the importance of using proposed design sensitivity to select proper fault tolerance schema.

REFERENCES


