Overview of Soft Error Protection
By soft errors, we mean errors in data values caused by a physical effect. The affected memory cell is not damaged. Therefore, a system restart will recover this error completely. We call the occurrence of such a soft error Single-Event-Upset (SEU)
Soft errors in the industry?
During the spacecraft Cassini flight, NASA reported a rate of 280 soft errors per day.
Is this relevant for us on the ground within an industrial plant?
The answer is: it depends. Soft errors are present everywhere at a specific rate. This soft error rate (SER) depends on many environmental and technological parameters. Tezzaron has a good collection of SERs.
Let's take a look at a long-running industrial device. This kind of system has been operational for many years. For an estimate, we assume a system with 2MB SRAM running for 20 years. With a middle range SER of 1e-12 errors per bit-hour, we can calculate:
1e-12errors/bit*h * 2Mbyte * 8bit/byte * 175.200h = 2,9errors
We realize that this is more than just a physical theory for long-running safety critical systems. There is a real probability that we will lose our safety functionality due to a soft error!
You will agree: this is not acceptable. We must avoid or at least detect soft errors in long-running industrial devices.
A growing number of soft errors
Much research is ongoing to understand, detect, or correct these soft errors in hardware. Following the document of Tezzaron, the sensitivity of components for error rates will rise with:
Increased complexity raises the error rate
Higher-density (higher-capacity) chips are more likely to have errors
Lower-voltage devices are more likely to have errors
Higher speeds (lower latencies) contribute to higher error rates
Lower cell capacitance (less stored charge) causes higher error rates
Comparing this list with current trends in the embedded market, we should prepare our systems for a growing number of soft errors.
Hardware measures
We found some particular radiation-hardened hardware components for aerospace and satellite systems. These devices are highly SEU immune. Unfortunately, these components are not designed for high-volume industrial use. If we use these devices, the system cost will rise above acceptable limits.
A memory technology designed for IT servers called chipkill is available. This technology is based on redundancy.
Looking at the industrial market, the hardware vendors identify this challenge, too. Several microcontrollers with soft error mitigation technologies inside have been released in the last few years.
ECC protected memory
The ECC memory devices use Error Correction Codes to store the data. The codes are classified as SEC-DED. This appreciation means single error correction - double error detection and is based on Hamming Codes. These codes are forward error correction (FEC) codes.
When working with ECC-protected memory, only a few operations are required in software during start-up.
After switching the power on, we must initialize the memory to a known state. Otherwise, we get notifications of ECC errors, which are, in fact, random content of the memory cells. Afterward, the hardware recovers soft errors transparently during regular operation. Most ECC memories provide recovery notifications, which we use to monitor the memory.
With this excellent hardware support, we are well-prepared for soft errors.
Software measures
Unfortunately, there are several system or hardware engineering reasons why we should prepare our software for systems without hardware measures. Examples: best-fit microcontroller didn't support ECC memory, or we need external memory without ECC.
We recommend performing a data storage analysis for soft error sensitivity in projects. This analysis helps decide which measure we use for each data storage. We base the decision to select a technique to avoid, recover, or detect soft errors on:
Amount of used memory (the more used memory, the higher the probability of a soft error)
lifetime of data (the longer the lifetime, the higher the probability of a soft error)
worst case system behavior on a soft error (the more critical the impact, the higher detection rates we need)
During this analysis, soft errors in unused memory cells are acceptable. With that approach, we keep performance and availability as high as possible while protecting our system against soft errors. We can classify the data storage:
Constant data (e.g., op-code, configuration tables)
Enumeration data (e.g., system states and modes)
Dynamic data (e.g., process values)
Temporary data (e.g., local variables, continuously refreshed memory)
The variables we can classify as Temporary data are uncritical and can stay without further protection measures. Therefore, cyclic refresh data from independent data sources is a good strategy.
// no refresh: read/modify/write
data = data + 1;
// refresh ok: update with independent value
data = value * 2;
Protecting constant data
To protect constant data in memory (like the application op-code or configuration tables), we calculate a Hash Value of the data and check the result with an expected (and stored) value.
The strength of protection depends on the number of bits used in the Hash Value. Widely used is the CRC32 algorithm. There are better choices than this because the collision rate depends on the width of the Hash value. A Hash collision is the effect that two different constant data memory images may result in the same Hash value.
Jef Pressing prepares a lovely overview of the collision probabilities. We see that with a rising number of bytes in our constant data memory, the probability of a CRC32 collision rises, too. For real projects, this leads to constraints like "CRC32 can protect a maximum of 4096 bytes" (the number of bytes depends on the Safety Integrity Level).
Therefore, we will look at FVN-Hash values with a scalable strength from 32bit up to 1024bit. The following pseudo-code shows the algorithm. It is a speedy and efficient way of calculating a Hash value:
hash = offset_basis
for each octet_of_data to be hashed
hash = hash * FNV_prime
hash = hash xor octet_of_data
return hash
The offset_basis and the FNV_prime are fixed values, dependent on the bit width of the hash value.
Protecting enumeration data
We avoid values like "1, 2, 3, etc." for storing enumeration data because a single bit-flip can change valid data into different valid data. We have no way to detect this bit-flip.
We can select specific values, which ensures that a single bit-flip results in an invalid value. If at least two bit-flips are necessary to change a valid value into another valid value, we call the selection: "values with a Hamming Distance (HD) of 2".
HD = number of different bits
For real projects, we choose values with a Hamming Distance of 4. See the following hexadecimal byte values:
{ 00, 0f, 33, 3c,
55, 5a, 66, 69,
96, 99, a5, aa,
c3, cc, f0, ff
}
Theoretically, we can correct single bit-flips by searching the value with the lowest Hamming Distance. This value is most likely the correct value.
Look at 0x3c. Assume Bit 1 is flipping - we get 0x3e. First, this is an invalid value. Second, we can check the Hamming Distance (in brackets below) to all valid data values; we get:
{ 00(5), 0f(3), 33(3), 3c(1),
55(5), 5a(3), 66(3), 69(5),
96(3), 99(5), a5(5), aa(3),
c3(7), cc(5), f0(5), ff(3)
}
If only 1 bit is flipping to get 0x3e, the value with the lowest HD is 0x3c is the correct value. In reality, we didn't know how many bits were flipping. This is the reason for the classification most likely.
We are not satisfied with most likely correct values in safety-critical software. For this reason, we usually raise a safety exception, which shuts down or restarts the device.
Protecting dynamic data
For dynamic data, any value is a valid value. We must add redundancy to detect a change (e.g., bit flip) in the value. A simple way of doing this is to store a variable mirrored in a different memory area. The principle pseudo-code for writing a value may look like:
variable = value;
variable_mirror = NOT(value);
We can now check the dynamically changing variable with the introduced redundancy at any time:
if variable != NOT(variable_mirror)
soft error detected
return variable
Note: This simple pseudo-code grows in complexity (and required run-time) when using interrupts, multi-threaded environments, DMA transfers, data caches, or multi-processor devices.
Conclusion
We discussed the need for soft error protection measures for long-running embedded devices.
An overview of widely used hardware and software measures covers:
ECC protection for memory devices
Hash values for constant data memory
Hamming distance for enumeration values
Redundancy for dynamic data memory
We can classify all measures in one of the following classes Hash Values, Hamming Codes, and Redundancy. These are the three main measures used in self-test detection algorithms and monitoring of system plausibility.