Data Storage in Embedded Systems

by Matthias Riegel (comments: 0)

Basic thoughts about data storage and management in Embedded Systems.

In this article, I will show some data structures for storing groups of records. They can all be used in any kind of storage (RAM, NVRAM, Flash, hard disk), but some may be more suitable for some storage devices than others.

Characteristics

The data structures differ in memory consumption, suitability for different storage mediums (Flash, Disk, RAM) and the computational cost for the different operations:

Add: Insert a new record into a data structure

Delete: Remove a record from a data structure

Update: Modify a record

Find: Get the location of a record in the data structure that has a specific attribute, either if the information is needed, or if the record should be modified/deleted.

Select: Sometimes, a set of records with an attribute in a range, or with the lowest values are needed. If not otherwise stated, finding them can be done using trivially modified find operations

Sort: In many applications outside the embedded world, sorted lists of records are used, but I will ignore this operation as I see few applications in Embedded Systems. The complexity is either that of your sorting algorithm, or O(N) if the data structure is already ordered by an attribute.

Records can either have a fixed size, or a dynamic size (e.g. strings).

An attribute or group of attributes that are used to find/select a set of records by will be called a key.

Data Structures

Some structures to store data

Unordered

Records of the same size are stored one after the other, without any sorting or management.

For constant data like configurations, only Find is needed, which just steps through the list, comparing each record with the search value, until it finds the record or reaches an end marker or (if the length of the data structure is known) the last record. This completes in O(N).

For dynamic storage, records must have a deleted bit which is set to 1 for, empty or deleted records. The delete and update operation needs to find the record and set its deleted bit or change other attributes. To add a record, you must find a record with the deleted bit set and overwrite it.

In it simplest form, unordered storage is well suited for small data sets where O(N) may still be extremely fast. The implementation is extremely simple, the space consumption is minimal. In case of hardware failure, there are no inconsistent states in the data structure (the contents of a record may still be inconsistent)

Improvements for specific access patterns:

  • Variable length records can be handled by storing the length of a record as the first element, find reads the length and then skips this number of bytes to get to the next record. delete and add have to be avoided or use some technique to avoid fragmentation, update only works if the record does not become larger. By sticking to a few distinct block sizes (e.g. 32, 64 and 128 byte) fragmentation can be reduced by merging blocks when deleting a record.

  • To reduce erases on flash storage, there is no update, instead, data is added with the same key in the next unused cell (flipping a unused bit from 1 to 0). delete can flip a valid bit from 1 to 0. find has to start at the end of the reserved area and only consider the first matching record. To select multiple values, a list of the keys of already visited records has to be kept so outdated copies can be ignored. Every once in a while, the whole data structure needs to be cleaned up to remove outdated copies. This is effectively a kind of journal (described below) For details on an actual usage scenario, see Fridolin's article on FEE Emulation in Flash

  • For many add and delete operations, both can be completed in O(1) by maintaining a linked list of free records as described below (keeping the link pointer in the space that would be used for data by valid records).

Sorted Linked List

Records of the same size are stored without a particular order, but each record has a link field that points to the record with the next higher key (or the next lower if you are mainly interested in high keys). A pointer to the record with lowest key is kept outside the data structure. This data structure only makes sense if data should be modified, therefore a linked list of empty fields is kept as well using the same link field. The head to the empty list is kept outside the data structure as well. No deleted or valid flag is needed as any record either belongs to the empty list, or to the data list. The end of a list is indicated by an invalid value in link.

To find or update a record, the linked list is traversed until the record with the search value is found which has O(N).

To add a record, the first record from the free list is taken and inserted in the data list. Then you have to find the record with the highest key which is still below the new record's key and insert the record into the linked list. Maintaining the free list has O(1), finding the lower record has O(N) and inserting has O(1) so add has O(N).

To delete a record the steps of add have to be reverted which also has O(N):

To select the record with the lowest value now only has O(1).

This data structure is useful when records with lower keys are added/removed/updated more often or to get cheap sort.

A linked list can be used as a stack (push and pop at the list head with O(1)) but still allowing to view (called peek), add, delete or update any element below the top of the stack. In this way it is used in many embedded systems to keep free resources until they are next used.

Modifications:

  • When storing records with variable size, the same restrictions as with unordered storage apply, except that find is not slowed down additionally.

  • In a Double Linked List, a record points to both the next element and the previous element. This allows operating cheaply on both ends of the list (O(1)) so it can be used as a Queue, also called Deque (double ended queue) or Fifo (first in first out).

Sorted Storage

Records of fixed size are stored in order of ascending key.

The find operation uses Binary Tree search:

  1. Set a search window to the whole Data Structure
  2. Compare the attribute you search for with the one of the record in the middle of the window
  3. If the record's attribute was larger, set the search window to the lower half of the previous window, otherwise to the upper half
  4. If the window size is greater than 1, go to step 3
  5. Compare the attribute of the only record in the window with the search attribute. You either have the record, or there is none.

The search completes in O(log(N)).

You can check in step 3 if you already have the right record and, if so, make an early exit, which will save half a loop iteration on average. (0 loops for half the records, 1 for quarter of the records, 2 for an eighth of the records, ...). Without this check, every lookup takes the same time (until the size changes across a power of 2).

The update operation uses find and then modifies the record.

To add or delete a record you have to find the position of the next higher record and then move all records with a higher key, which takes O(N) operations. Records can not simply be marked as deleted since the tree search would not work.

To select the record with the lowest value only has O(1).

Since find and update are very fast, this data structure is useful for records that are read or written often but where records are rarely added or removed.

Michael used it (with linear search instead of tree search) in his article on Dynamic Object Directory for CANOpen

Storage by key (Lookup table)

Records of fixed size are stored at the position key * record_size in a table of the size N * record_size where N is the number of possible values for key.

The space for all possible keys must be reserved from the beginning so this is only applicable for small records, small key sizes or if most possible values of the key are in use.

Find and Update have O(1).

To select the record with the lowest key only has O(1).

To support add and delete operations, a deleted bit is needed which they set accordingly and are otherwise like update.

The key is not stored in the record, as it can be calculated from the position of a record.

Indices

An index extends a data structure with some management info about the location of records depending on a key. In this case, the key that the data structure uses is called the primary key while the index speeds up the find, sort or select operation for the secondary key. Multiple indices can be attached to a data structure to support multiple secondary keys.

For some data structures, changing the primary key of a record requires a complete delete and add operation, effectively moving the complete record. Changing a secondary key only requires a delete and add in the index, which means moving one pointer.

While indices can speed up select, sort and find when using secondary keys, they have to be maintained in every operation that modifies the primary data structure.

Lookup Table

When a secondary key is densely populated and unique, the lookup table described above can be used to store [the position of a record inside its primary data structure] at [the position [secondary key] in the lookup table]. (At this point I find the human language lacking easy tools to communicate grouping correctly so I used brackets.)

The find operation when searching for a secondary key can be reduced from O(N) to O(1) and sorting by the secondary key becomes O(N). The cost of delete, update and add increases by a constant amount O(1).

When the secondary key is not unique, one short linked list for each distinct value has to be maintained. The lookup table stores a pointer to one record, each record stores a pointer to the next in a link_2nd_key field. Traversing them during find or select or maintaining them during delete, update and add, takes O(k) operations where k is the number of records that have the same secondary key. k should be small.

Hashmap

The hashmap is similar to a Lookup Table Index with non unique secondary key, but the key does not have to be densely populated to be efficient and the Look up is not done by the key value but by a hash value computed from the key. The many short linked lists contain all records that have a secondary key which produces the same hash value.

As above, find, add, delete and update take O(k) where k is the number of items that have the same hash as the searched value.

The size of the Table is L. Small L lead to less space consumption but larger L allow faster lookups. Using L = 1.2 * N is reasonable. The hash value has to be in the range [0..L-1].

It is a good idea to make L a power of two because then, any hash function can be used and cheaply scaled into the correct range by bitwise and with L-1. A good hash function will distribute input values evenly across the output range, so that long lists where hash values collide are unlikely. If the key is itself well distributed, no hash function besides scaling into range is needed.

An article on hash functions that will go into some detail is coming up in the future.

Journaling

Hardware failure, loss of power or watchdogs can interrupt the storage sequence, which can lead to inconsistencies in the data structure, between index and data structure, among the attributes of a record or among different but related records in the same or different data structures. When opening the data structure(s), all records must be checked for consistency which can be very time consuming. Data which is redundant (Indices) can easily be regenerated, but the rest is can be lost or irrecoverably corrupted.

A data structure which is immune to these problems is the journal. A journal can be used to store data directly as the 2nd variation of unordered storage does, but this requires a lot of space. A journal can also be used to keep a set of information while another data structure is modified. Should the modification be interrupted, the information from the journal can be used to make the data structure consistent again, which is described below.

A Journal consists of one or more blocks. Each block has an occupied and a valid bit, and space to store the record which should be modified in the primary data structure.

Any operation on the primary data structure is performed as follows:

  1. Find the lowest free block in the journal
  2. Set the block's occupied bit
  3. Store what kind of operation should be performed
  4. Store the new record in the block (if not doing a delete)
  5. Set the block's valid bit
  6. Perform the operation on the primary data structure
  7. Clear the block's occupied bit.

When opening the data structure, search for any blocks that have occupied set. For all occupied blocks that also have the valid bit set, redo the operation on the primary data structure and clear the block. Occupied blocks which are not valid can be cleared without further action.

To consistently update multiple records, either all records are put into the same journal block, or multiple journal blocks need to link to each other, forming a list, and only the valid bit of the last block of the list counts. Variable sized journal blocks are also possible, bringing the same complications as variable sized data structures described above,

With journaling, a data structure is always in one of the following states:

  • No occupied journaling blocks, the data structure is in a consistent state.
  • One ore more occupied but not valid blocks in the journal. An update was lost, but the data structure is still consistent.
  • One or more valid blocks in the journal, the data structure can either be:
    • Unmodified, consistent
    • Partially updated, inconsistent
    • Completely updated, consistent The update can be completed and the data structure made consistent with the information from the journal block.

Thus, with journaling, an interruption during storage is no worse than an interruption during any other step in an application.

Steps (6.) and (7.) can be postponed to save time during an update, but then all operations have to check the journal before accessing the data structure. This can be a huge performance gain if the data structure is on a slow disk and a journal with several blocks on something fast like NVRAM.

To avoid frequent erasures on flash, a complete sector can be reserved for the journal. Instead of clearing a journal block, it is marked as outdated. Once there are no more free entries, all pending updates must be completed and the sector can be erased.

More

There are also graph oriented methods of storing data that can for example be used in describing networks, specialized data structures for spatial data, tries that store string keys along the way down a search tree, and many more.

Better performance for find, select and sort can be achieved by using more complicated but very fast tree structures like the B+ Tree (self balancing, with several children per node pointers to the records at the lowest level, and a kind of linked list, allowing sorted retrieval, suitable for use on file systems) or T-Trees (Binary Tree with pointers to several records in each node, lower and higher values in left and right child node, suitable for use in RAM). More efficient add, update and delete can be achieved with better management of storage space, respecting to the size of file system pages, virtual memory pages, cache usage, etc.

Conclusion

The data structures and algorithms that I described above can all be implemented relatively easy. Choosing the right data structures, primary and secondary keys, index methods and storage medium for data, index and journal gives huge flexibility to fine tune data storage to an application's access patterns and real time constrains if you know them in advance. You can implement, tweak and understand the whole storage system yourself.

A Relational Database Management System will do the same job, probably faster, without data corruption, supporting concurrent access, with advanced features like joins, subselects and views and provide means for redundancy across multiple hosts. It will also reduce your code size, debugging time and probably headaches. You only describe the data and the queries you want to make, the database engine chooses the layout that gives the best performance.

Go back

Add a comment

Update Notification

For an automatic notification on new blog articles, just register your EMail address.

Subscription

We are the Blogger:

Andrea Dorn

After my study of industrial engineering I worked at an engineering service provider. As team leader and sales representative, I was responsible for customers from aviation and mechanical engineering. I am part of the Embedded Office team since 2010. Here I am responsible for the Sales and Marketing activities. I love being outside for hiking, riding or walking no matter the weather.

Fridolin Kolb

I have more than 20 years experience in developing safety critical software as developer and project manager in medical, aerospace and automotive industries. I am always keen on finding a solution for any problem. The statement “This won’t never work”, will never work for me. In my spare time You can find me playing the traverse flute in our local music association, spending time with my family, or in a session as member of our local council and member of the local church council. So obviously I am lacking the ability to say “No” to any challenge ;-).

Michael Hillmann

I have been working for 20 years in safety critical software development. Discussing and solving challenges with customers and colleagues excites me again and again. In my spare time I can be found while hiking, spending time with my family, having a sauna with friends - or simply reading a good book.

Wolfgang Engelhard

I’m a functional safety engineer with over 10 years of experience in software development. I’m most concerned with creating accurate documentation for safety critical software, but lately found joy in destruction of software (meaning I’m testing). Spare time activities range from biking to mentoring a local robotics group of young kids.

Matthias Riegel

Since finishing my master in computer science (focus on Embedded Systems and IT-Security), I’ve been working at Embedded Office. Before that, I worked with databases, and learned many unusual languages (like lisp, clojure, smalltalk, io, prolog, …). In my spare time I’m often on my bike, at the lathe or watching my bees.