Embedded Office Blog

Soft Error Protection

by Michael Hillmann

In this article I want to give an overview of measures we can insert in our safety critical software to protect our system from sporadic malfunction, caused by a soft error.

By soft errors I mean errors in data values, which are caused by a physical effect. The affected memory cell is not damaged. Therefore a system restart will recover this error completely. The occurrence of such a soft error is called: Single Event Upset (SEU)

Soft errors in industry?

During the flight of the spacecraft Cassini, the NASA reports a rate of 280 soft errors per day1.

Well, is this relevant for us on ground within an industrial plant?

The answer is: it depends. Soft errors are present everywhere with a specific rate. This soft error rate (SER) depends on many environmental and technology parameters, Tezzaron has a good collection of SERs2.

Lets take a look on a long running industrial device. This kind of system is started and operational for many years. For an estimate, we assume a system with 2MB SRAM, running for 20 years. With a middle range SER of 1e-12 errors per bit-hour, we can calculate:

1e-12errors/bit*h * 2Mbyte * 8bit/byte * 175.200h = 2,9errors

We realize: for long running safety critical systems this is not only a physical theory. There is a real probability that we loose our safety functionality in consequence of a soft error!

I think you will agree: this is not acceptable. We need to avoid or at least detect soft errors in long running industrial devices.

Growing number of soft errors

A lot of research is ongoing to understand, detect or correct these soft errors in hardware. Following the document of Tezzaron, the sensitivity of components for error rates will raise with:

  • Increased complexity raises the error rate
  • Higher-density (higher-capacity) chips are more likely to have errors
  • Lower-voltage devices are more likely to have errors
  • Higher speeds (lower latencies) contribute to higher error rates
  • Lower cell capacitance (less stored charge) causes higher error rates

Comparing this list with current trends in the embedded market, we should prepare our systems to a growing number of soft errors.

Hardware measures

I found some special radiation hardened hardware components for aerospace and satellite systems. These devices are highly SEU immune. Unfortunately these components are not designed for high volume industrial use. If we use these devices, the system cost will rise above acceptable limits.

Designed for IT server a memory technology called chipkill is available3. This technology is based on redundancy.

Looking to the industrial market the hardware vendors identify this challenge, too. During the last years, there are a rising number of micro controllers released with soft error mitigation technologies inside.

ECC protected memory

The ECC memory devices are using Error Correction Codes for storing the data. The codes are classified as SEC-DED. This means: single error correction - double error detection and are based on Hamming Codes. These codes are forward error correction (FEC) codes. Details of the theory behind the error correction codes are well explained in the document4.

When working with ECC protected memory, there are only a few operations required in software during start up.

After power on, we must initialize the memory to a known state. Otherwise we get notifications of ECC errors, which are in fact random content of the memory cells. Afterward, the hardware recovers soft errors transparently during normal operation. Most ECC memories provides recovery notifications, which I use to monitor the memory.

With this excellent hardware support, we are well prepared for soft errors.

Software measures

Unfortunately, there are several system or hardware engineering reasons, why we should prepare our software for systems without hardware measures. Examples: best fit micro-controller didn't support ECC memory, or we need external memory without ECC.

In my projects, I always recommend to perform a data storage analysis for soft error sensitivity. This helps us to decide, which measure we use for each individual data storage. Our decision for selecting a technique to avoid, recover or detect the soft errors is based on:

  • amount of used memory (the more used memory, the higher the probability of a soft error)
  • lifetime of data (the longer the lifetime, the higher the probability of a soft error)
  • worst case system behavior on a soft error (the more critical the impact, the higher detection rates must be achieved)

During this analysis, we can keep in mind, that soft errors in unused memory cells are acceptable. With that approach we keep performance and availability, as high as possible, while protecting our system against soft errors. We can classify the data storage in groups:

  • Constant data (e.g. op-code, configuration tables)
  • Enumeration data (e.g. system states and modes)
  • Dynamic data (e.g. process values)
  • Temporary data (e.g. local variables, continuously refreshed memory)

The variables, which we can classify in Temporary data are most likely uncritical and can left without any further protection measures. Therefore it is a good strategy to cyclic refresh data from independent data sources.

data = data + 1;    /* no refresh - read/modify/write depends on data */
data = value * 2;   /* refresh ok - update with independent value */

Protecting constant data

To protect constant data in memory (like the application op-code or configuration tables) we calculate a Hash Value of the data and check the result with an expected (and stored) value.

The strength of protection depends on the number of bits, used in the Hash Value. Widely used is the CRC32 algorithm. Unfortunately this is not the best choice, because the collision rate depends on the width of the Hash value. A Hash collision is the effect, that two different constant data memory images may result in the same Hash value.

A very nice prepared overview of the collision probabilities is prepared by Jef Pressing5. We see, that with a rising number of bytes in our constant data memory, the probability of a CRC32 collision rises, too. For real projects, this leads to constraints like "CRC32 can protect a maximum of 4096 bytes" (the number of bytes depends on the Safety Integrity Level).

Collision Probabilities

Therefore we will take a look at FVN-Hash6 values with a scalable strength from 32bit up to 1024bit. The following pseudo code shows the algorithm. It is a very fast and efficient way for calculating a Hash value:

hash = offset_basis
for each octet_of_data to be hashed
    hash = hash * FNV_prime
    hash = hash xor octet_of_data
return hash

The offset_basis and the FNV_prime are fixed values, dependent on the bit width of the hash value.

Protecting enumeration data

For storing enumeration data we avoid values like "1, 2, 3, etc.", because a single bit-flip can change a valid data into a different valid data. We have no way to detect this bit-flip.

We can select specific values, which ensures that a single bit-flip results in an invalid value. If at least two bit-flips are necessary to change a valid value into another valid value, we call the selection: "values with a Hamming Distance (HD) of 2".

HD = number of different bits

For real projects, we choose values with a Hamming Distance of 4. See the following hexadecimal byte values:

{ 00, 0f, 33, 3c,
  55, 5a, 66, 69,
  96, 99, a5, aa,
  c3, cc, f0, ff

In theory we can correct single bit-flips by searching the value with the lowest Hamming Distance. This value is most likely the correct value.

Look at 0x3c. Assume Bit 1 is flipping - we get: 0x3e. First, this is an invalid value. Second we can check the Hamming Distance (in brackets below) to all valid data values, we get:

{ 00(5), 0f(3), 33(3), 3c(1),
  55(5), 5a(3), 66(3), 69(5),
  96(3), 99(5), a5(5), aa(3),
  c3(7), cc(5), f0(5), ff(3)

If only 1 bit is flipping to get 0x3e, the value with the lowest HD is 0x3c is the correct value. In reality, we didn’t know how many bits are flipping. This is the reason for my classification most likely.

In safety critical software we are not satisfied with most likely correct values. For this reason we usually raise a safety exception, which shuts down or restarts the device.

Protecting dynamic data

For dynamic data, any value is a valid value. We need to add some redundancy to detect a change (e.g. bit flip) in the value. A simple way in doing this is to store a variable mirrored in a different memory area. The principle pseudo-code for writing a value may look like:

variable = value;
variable_mirror = NOT(value);

We can check now the dynamically changing variable with the introduced redundancy at any time:

if variable != NOT(variable_mirror)
    soft error detected
return variable

Note: This simple pseudo-code grows in complexity (and required run-time) when using multi-threaded environments, DMA transfers, data caches or multi processor devices.


In this article we discussed the need of soft error protection measures for long running embedded devices.

An overview on widely used hardware and software measures covers:

  • ECC protection for memory devices
  • Hash values for constant data memory
  • Hamming distance for enumeration values
  • Redundancy for dynamic data memory

We see, that all measures can be classified in one of the following classes Hash Values, Hamming Codes and Redundancy.

Finally, my experience with soft errors has been changed over the years. In the beginning of my career, the consideration of soft errors were focused to space projects. Today, 20 years later, in every safety critical industrial project we need to consider soft errors in the software concept.

My estimation for the future is a rising number of projects, where soft errors must be considered - even for non safety critical devices. Do you agree?

Go back

Update Notification

For an automatic notification on new blog articles, just register your EMail address.

We are the Blogger:

Andrea Dorn

After my study of industrial engineering I worked at an engineering service provider. As team leader and sales representative, I was responsible for customers from aviation and mechanical engineering. I am part of the Embedded Office team since 2010. Here I am responsible for the Sales and Marketing activities. I love being outside for hiking, riding or walking no matter the weather.

Fridolin Kolb

I have more than 20 years experience in developing safety critical software as developer and project manager in medical, aerospace and automotive industries. I am always keen on finding a solution for any problem. The statement “This won’t never work”, will never work for me. In my spare time You can find me playing the traverse flute in our local music association, spending time with my family, or in a session as member of our local council and member of the local church council. So obviously I am lacking the ability to say “No” to any challenge ;-).

Michael Hillmann

I have been working for 20 years in safety critical software development. Discussing and solving challenges with customers and colleagues excites me again and again. In my spare time I can be found while hiking, spending time with my family, having a sauna with friends - or simply reading a good book.

Wolfgang Engelhard

I’m a functional safety engineer with over 10 years of experience in software development. I’m most concerned with creating accurate documentation for safety critical software, but lately found joy in destruction of software (meaning I’m testing). Spare time activities range from biking to mentoring a local robotics group of young kids.

Matthias Riegel

Since finishing my master in computer science (focus on Embedded Systems and IT-Security), I’ve been working at Embedded Office. Before that, I worked with databases, and learned many unusual languages (like lisp, clojure, smalltalk, io, prolog, …). In my spare time I’m often on my bike, at the lathe or watching my bees.