Blog Reader

 

 
 

Portfolio Projekte

Soft Error Protection

von Michael Hillmann

In this article I want to give an overview of measures we can insert in our safety critical software to protect our system from sporadic malfunction, caused by a soft error.

By soft errors I mean errors in data values, which are caused by a physical effect. The affected memory cell is not damaged. Therefore a system restart will recover this error completely. The occurrence of such a soft error is called: Single Event Upset (SEU)

Soft errors in industry?

During the flight of the spacecraft Cassini, the NASA reports a rate of 280 soft errors per day1.

Well, is this relevant for us on ground within an industrial plant?

The answer is: it depends. Soft errors are present everywhere with a specific rate. This soft error rate (SER) depends on many environmental and technology parameters, Tezzaron has a good collection of SERs2.

Lets take a look on a long running industrial device. This kind of system is started and operational for many years. For an estimate, we assume a system with 2MB SRAM, running for 20 years. With a middle range SER of 1e-12 errors per bit-hour, we can calculate:

1e-12errors/bit*h * 2Mbyte * 8bit/byte * 175.200h = 2,9errors

We realize: for long running safety critical systems this is not only a physical theory. There is a real probability that we loose our safety functionality in consequence of a soft error!

I think you will agree: this is not acceptable. We need to avoid or at least detect soft errors in long running industrial devices.

Growing number of soft errors

A lot of research is ongoing to understand, detect or correct these soft errors in hardware. Following the document of Tezzaron, the sensitivity of components for error rates will raise with:

  • Increased complexity raises the error rate
  • Higher-density (higher-capacity) chips are more likely to have errors
  • Lower-voltage devices are more likely to have errors
  • Higher speeds (lower latencies) contribute to higher error rates
  • Lower cell capacitance (less stored charge) causes higher error rates

Comparing this list with current trends in the embedded market, we should prepare our systems to a growing number of soft errors.

Hardware measures

I found some special radiation hardened hardware components for aerospace and satellite systems. These devices are highly SEU immune. Unfortunately these components are not designed for high volume industrial use. If we use these devices, the system cost will rise above acceptable limits.

Designed for IT server a memory technology called chipkill is available. This technology is based on redundancy.

Looking to the industrial market the hardware vendors identify this challenge, too. During the last years, there are a rising number of micro controllers released with soft error mitigation technologies inside.

ECC protected memory

The ECC memory devices are using Error Correction Codes for storing the data. The codes are classified as SEC-DED. This means: single error correction - double error detection and are based on Hamming Codes. These codes are forward error correction (FEC) codes. Details of the theory behind the error correction codes are well explained in the book "Hacker's Delight", chapter 15.

When working with ECC protected memory, there are only a few operations required in software during start up.

After power on, we must initialize the memory to a known state. Otherwise we get notifications of ECC errors, which are in fact random content of the memory cells. Afterward, the hardware recovers soft errors transparently during normal operation. Most ECC memories provides recovery notifications, which I use to monitor the memory.

With this excellent hardware support, we are well prepared for soft errors.

Software measures

Unfortunately, there are several system or hardware engineering reasons, why we should prepare our software for systems without hardware measures. Examples: best fit micro-controller didn't support ECC memory, or we need external memory without ECC.

In my projects, I always recommend to perform a data storage analysis for soft error sensitivity. This helps us to decide, which measure we use for each individual data storage. Our decision for selecting a technique to avoid, recover or detect the soft errors is based on:

  • amount of used memory (the more used memory, the higher the probability of a soft error)
  • lifetime of data (the longer the lifetime, the higher the probability of a soft error)
  • worst case system behavior on a soft error (the more critical the impact, the higher detection rates must be achieved)

During this analysis, we can keep in mind, that soft errors in unused memory cells are acceptable. With that approach we keep performance and availability, as high as possible, while protecting our system against soft errors. We can classify the data storage in groups:

  • Constant data (e.g. op-code, configuration tables)
  • Enumeration data (e.g. system states and modes)
  • Dynamic data (e.g. process values)
  • Temporary data (e.g. local variables, continuously refreshed memory)

The variables, which we can classify in Temporary data are most likely uncritical and can left without any further protection measures. Therefore it is a good strategy to cyclic refresh data from independent data sources.

data = data + 1;    /* no refresh - read/modify/write depends on data */
data = value * 2;   /* refresh ok - update with independent value */

Protecting constant data

To protect constant data in memory (like the application op-code or configuration tables) we calculate a Hash Value of the data and check the result with an expected (and stored) value.

The strength of protection depends on the number of bits, used in the Hash Value. Widely used is the CRC32 algorithm. Unfortunately this is not the best choice, because the collision rate depends on the width of the Hash value. A Hash collision is the effect, that two different constant data memory images may result in the same Hash value.

A very nice prepared overview of the collision probabilities is prepared by Jef Pressing3. We see, that with a rising number of bytes in our constant data memory, the probability of a CRC32 collision rises, too. For real projects, this leads to constraints like "CRC32 can protect a maximum of 4096 bytes" (the number of bytes depends on the Safety Integrity Level).

Collision Probabilities

Therefore we will take a look at FVN-Hash4 values with a scalable strength from 32bit up to 1024bit. The following pseudo code shows the algorithm. It is a very fast and efficient way for calculating a Hash value:

hash = offset_basis
for each octet_of_data to be hashed
    hash = hash * FNV_prime
    hash = hash xor octet_of_data
return hash

The offset_basis and the FNV_prime are fixed values, dependent on the bit width of the hash value.

Protecting enumeration data

For storing enumeration data we avoid values like "1, 2, 3, etc.", because a single bit-flip can change a valid data into a different valid data. We have no way to detect this bit-flip.

We can select specific values, which ensures that a single bit-flip results in an invalid value. If at least two bit-flips are necessary to change a valid value into another valid value, we call the selection: "values with a Hamming Distance (HD) of 2".

HD = number of different bits

For real projects, we choose values with a Hamming Distance of 4. See the following hexadecimal byte values:

{ 00, 0f, 33, 3c,
  55, 5a, 66, 69,
  96, 99, a5, aa,
  c3, cc, f0, ff
}

In theory we can correct single bit-flips by searching the value with the lowest Hamming Distance. This value is most likely the correct value.

Look at 0x3c. Assume Bit 1 is flipping - we get: 0x3e. First, this is an invalid value. Second we can check the Hamming Distance (in brackets below) to all valid data values, we get:

{ 00(5), 0f(3), 33(3), 3c(1),
  55(5), 5a(3), 66(3), 69(5),
  96(3), 99(5), a5(5), aa(3),
  c3(7), cc(5), f0(5), ff(3)
}

If only 1 bit is flipping to get 0x3e, the value with the lowest HD is 0x3c is the correct value. In reality, we didn’t know how many bits are flipping. This is the reason for my classification most likely.

In safety critical software we are not satisfied with most likely correct values. For this reason we usually raise a safety exception, which shuts down or restarts the device.

Protecting dynamic data

For dynamic data, any value is a valid value. We need to add some redundancy to detect a change (e.g. bit flip) in the value. A simple way in doing this is to store a variable mirrored in a different memory area. The principle pseudo-code for writing a value may look like:

variable = value;
variable_mirror = NOT(value);

We can check now the dynamically changing variable with the introduced redundancy at any time:

if variable != NOT(variable_mirror)
    soft error detected
return variable

Note: This simple pseudo-code grows in complexity (and required run-time) when using multi-threaded environments, DMA transfers, data caches or multi processor devices.

Conclusion

In this article we discussed the need of soft error protection measures for long running embedded devices.

An overview on widely used hardware and software measures covers:

  • ECC protection for memory devices
  • Hash values for constant data memory
  • Hamming distance for enumeration values
  • Redundancy for dynamic data memory

We see, that all measures can be classified in one of the following classes Hash Values, Hamming Codes and Redundancy.

Finally, my experience with soft errors has been changed over the years. In the beginning of my career, the consideration of soft errors were focused to space projects. Today, 20 years later, in every safety critical industrial project we need to consider soft errors in the software concept.

My estimation for the future is a rising number of projects, where soft errors must be considered - even for non safety critical devices. Do you agree?


  1. In-Flight Observations of Multiple-Bit Upset in DRAM; https://trs.jpl.nasa.gov/handle/2014/15831/ ↩︎

  2. Soft Errors in Electronic Memory; https://tezzaron.com/media/soft_errors_1_1_secure.pdf ↩︎

  3. Hash Collision Probabilities; http://preshing.com/20110504/hash-collision-probabilities/ ↩︎

  4. FNV Hash Algorithm; http://www.isthe.com/chongo/tech/comp/fnv/ ↩︎

Zurück

Sie haben Interesse an einem unverbindlichen Gespräch ...

Vereinbaren Sie ein Experten Gespräch

Wir bieten Ihnen ein kostenloses und unverbindliches Gespräch mit einem unserer Experten an. Sie können uns dabei kennen lernen, Ihre noch offenen Fragen beantwortet bekommen und auch wir können gemeinsam erste Anforderungen Ihres Projektes erörtern.