Embedded Office Article

How to realize a Soft Error Protection

In this article, I want to give an overview of measures we can insert in our safety-critical software to protect our system from sporadic malfunction, caused by a soft error.

By soft errors, I mean errors in data values, which are caused by a physical effect. The affected memory cell is not damaged. Therefore a system restart will recover this error completely. We call the occurrence of such a soft error: Single Event Upset (SEU)

Soft errors in the industry?

During the flight of the spacecraft Cassini, the NASA reports a rate of 280 soft errors per day[^1].

Well, is this relevant for us on the ground within an industrial plant?

The answer is: it depends. Soft errors are present everywhere with a specific rate. This soft error rate (SER) depends on many environmental and technology parameters; Tezzaron has a good collection of SERs[^2].

Let's take a look at a long-running industrial device. This kind of system is started and operational for many years. For an estimate, we assume a system with 2MB SRAM, running for 20 years. With a middle-range SER of 1e-12 errors per bit-hour, we can calculate:

1e-12errors/bit*h * 2Mbyte * 8bit/byte * 175.200h = 2,9errors

We realize: for long-running safety-critical systems, this is not only a physical theory. There is a real probability that we lose our safety functionality in consequence of a soft error!

I think you will agree: this is not acceptable. We need to avoid or at least detect soft errors in long-running industrial devices.

Growing number of soft errors

A lot of research is ongoing to understand, detect, or correct these soft errors in hardware. Following the document of Tezzaron, the sensitivity of components for error rates will raise with:

  • Increased complexity grows the error rate
  • Higher-density (higher-capacity) chips are more likely to have errors
  • Lower-voltage devices are more likely to have errors
  • Higher speeds (lower latencies) contribute to higher error rates
  • Lower cell capacitance (less stored charge) causes higher error rates

Comparing this list with current trends in the embedded market, we should prepare our systems to a growing number of soft errors.

Hardware measures

I found some radiation-hardened hardware components for aerospace and satellite systems. These devices are highly SEU immune. Unfortunately, the target market of these components is not the high volume of industrial use. If we use these devices, the system cost will rise above acceptable limits.

Designed for IT server, a memory technology called chipkill is available. The base for this technology is redundancy.

Looking at the industrial market, the hardware vendors identify this challenge, too. During the last years, there are a rising number of microcontrollers released with soft error mitigation technologies inside.

ECC protected memory

The ECC memory devices are using Error Correction Codes for storing the data. We classify the ECC codes as SEC-DED which means: single error correction - double error detection. The base for the ECC codes is Hamming Codes. These codes are forward error correction (FEC) codes. For details, I recommend reading the book Hacker's Delight, chapter 15. This chapter explains the theory behind the error correction codes very well.

When working with ECC protected memory, there are only a few operations required in software during startup.

After power on, we must initialize the memory to a known state. Otherwise, we get notifications of ECC errors, which are random content of the memory cells. Afterward, the hardware recovers soft errors transparently during regular operation. Most ECC memories provide recovery notifications, which I use to monitor the memory.

With this excellent hardware support, we are well prepared for soft errors.

Software measures

Unfortunately, there are several systems or hardware engineering reasons why we should prepare our software for systems without hardware measures. Examples: best fit micro-controller didn't support ECC memory, or we need external memory without ECC.

In my projects, I always recommend performing a data storage analysis for soft error sensitivity. This analysis helps us to decide which measure we use for each data storage. We base our decision for selecting a technique to avoid, recover, or detect the soft errors on:

  • amount of used memory (the more used memory, the higher the probability of a soft error)
  • the lifetime of data (the longer the lifetime, the higher the likelihood of a soft error)
  • worst case system behavior on a soft error (the more critical the impact, the higher detection rates we must achieve)

During this analysis, we can keep in mind that soft errors in unused memory cells are acceptable. With that approach, we keep performance and availability, as high as possible, while protecting our system against soft errors. We can classify the data storage in groups:

  • Constant data (e.g., op-code, configuration tables)
  • Enumeration data (e.g., system states and modes)
  • Dynamic data (e.g., process values)
  • Temporary data (e.g., local variables, continuously refreshed memory)

The variables, which we can classify in Temporary data are most likely uncritical. We can leave them without any further protection measures. Therefore it is an excellent strategy to cyclic refresh data from independent data sources.

data = data + 1;    /* no refresh - read/modify/write depends on data */
data = value * 2;   /* refresh ok - update with independent value */

Protecting constant data

To protect constants in memory (like the application op-code or configuration tables), we calculate a Hash Value of the data and check the result with an expected (and stored) value.

The strength of protection depends on the number of bits, used in the Hash Value. Widely used is the CRC32 algorithm. Unfortunately, this is not the best choice because the collision rate depends on the width of the Hash value. A Hash collision is an effect, where two different constant data memory images may result in the same Hash value.

I found an excellent prepared overview of the collision probabilities by Jef Pressing[^3]. We see, that with a rising number of bytes in our constant data memory, the likelihood of a CRC32 collision rises, too. For real projects, this leads to constraints like "CRC32 can protect a maximum of 4096 bytes" (the number of bytes depends on the Safety Integrity Level).

Collision Probabilities

Therefore we will take a look at FVN-Hash[^4] values with a scalable strength from 32bit up to 1024bit. The following pseudo-code shows the algorithm. It is a speedy and efficient way for calculating a Hash value:

hash = offset_basis
for each octet_of_data we want to protect:
    hash = hash * FNV_prime
    hash = hash xor octet_of_data
return hash

The offset_basis and the FNV_prime are fixed values, dependent on the bit width of the hash value.

Protecting enumeration data

For storing enumeration data, we avoid values like "1, 2, 3, etc.", because a single bit-flip can change valid data into different accurate data. We have no way to detect this bit-flip.

We can select specific values, which ensures that a single bit-flip result in an invalid value. If at least two bit-flips are necessary to change a valid value into another valid value, we call the selection: "values with a Hamming Distance (HD) of 2".

HD = number of different bits

For real projects, we choose values with a Hamming Distance of 4. See the following hexadecimal byte values:

{ 00, 0f, 33, 3c,
  55, 5a, 66, 69,
  96, 99, a5, aa,
  c3, cc, f0, ff

In theory, we can correct single bit-flips by searching the value with the lowest Hamming Distance. This value is most likely the correct value.

Look at 0x3c. Assume Bit 1 is flipping - we get: 0x3e. First, this is an invalid value. Second we can check the Hamming Distance (in brackets below) to all valid data values, we get:

{ 00(5), 0f(3), 33(3), 3c(1),
  55(5), 5a(3), 66(3), 69(5),
  96(3), 99(5), a5(5), aa(3),
  c3(7), cc(5), f0(5), ff(3)

If only 1 bit is flipping to get 0x3e, the value with the lowest HD is 0x3c is the correct value. In reality, we didn't know how many bits are flipping. This risk is the reason for my classification most likely.

In safety-critical software, we are not satisfied with most likely correct values. For this reason, we usually raise a safety exception, which shuts down or restarts the device.

Protecting dynamic data

For dynamic data, any value is a valid value. We need to add some redundancy to detect a change (e.g., bit flip) in the value. A simple way in doing this is to store a variable mirrored in a different memory area. The principle pseudo-code for writing a value may look like:

variable = value;
variable_mirror = NOT(value);

We can check now the dynamically changing variable with the introduced redundancy at any time:

if variable != NOT(variable_mirror)
    soft error detected
return variable

Note: This simple pseudo-code grows in complexity (and required run-time) when using multi-threaded environments, DMA transfers, data caches or multi-processor devices.


In this article, we discussed the need for soft error protection measures for long-running embedded devices.

An overview of widely used hardware and software measures covers:

  • ECC protection for memory devices
  • Hash values for constant data memory
  • Hamming distance for enumeration values
  • Redundancy for dynamic data memory

We see that we can classify all measures in one of the following classes Hash Values, Hamming Codes and Redundancy.

Finally, my experience with soft errors changes over the last decade. At the beginning of my career, the consideration of soft errors was focused on space projects. Today, 20 years later, in every safety-critical industrial project, we need to consider soft errors in the software concept.

[^1]:In-Flight Observations of Multiple-Bit Upset in DRAM; https://trs.jpl.nasa.gov/handle/2014/15831/

[^2]:Soft Errors in Electronic Memory; https://tezzaron.com/media/soft_errors_1_1_secure.pdf

[^3]:Hash Collision Probabilities; http://preshing.com/20110504/hash-collision-probabilities/

[^4]:FNV Hash Algorithm; http://www.isthe.com/chongo/tech/comp/fnv/

Related References

Create Your Free Account
Create an account to get access to free Embedded Office services
Access free Embedded Office services
Related Links
Explore the possible System Integration Services
Explore your project specific Tailoring Services
© Copyright 2019. Embedded Office GmbH & Co. KG. All rights reserved. (Version: 3c58d75)