Monday, January 19, 2015

Memory optimizing for embedded system products

Optimization is important to embedded software developers because they are always facing limited resources. So, being able to control the size and speed trade-off with code is critical. It is less common for thought to be given to the optimization of data, where there can be a similar speed-versus-size tension. This article looks at how this conflict comes about and what the developer can do about it.

A key difference between embedded and desktop system programming is variability: every Windows PC is essentially the same, whereas every embedded system is different. There are a number of implications of this variability: tools need to be more sophisticated and flexible; programmers need to be ready to accommodate the specific requirements of their system; standard programming languages are mostly non-ideal for the job. This last point points towards a key issue: control of optimization.

Optimization is a set of processes and algorithms that enable a compiler to advance from translating code from (say) C into assembly language to translating an algorithm expressed in C into a functionally identical one expressed in assembly. This is a subtle but important difference.

Data/memory optimization
A key aspect of optimization is memory utilization. Typically, a decision has to be made in the trade-off between having fast code or small code - it is rare to have the best of both worlds. This decision also applies to data. The way data is stored into memory affects its access time. With a 32-bit CPU, if everything is aligned with word boundaries, access time is fast; this is termed ‘unpacked data’. Alternatively, if bytes of data are stored as efficiently as possible, it may take more effort to retrieve data and hence the access time is slower; this is ‘packed’ data. So you have a choice much the same as with code: compact data that is slow to access, or some wasted memory but fast access to data.

For example, this structure:

   struct
   {
      short two_byte;
      char one_byte;
   } my_array[4];


could be mapped into memory in a number of ways. The C language standard gives the compiler complete freedom in this regard. Two possibilities are: packed, like this:


or unpacked like this:


Unpacked could be even more wasteful. This graphic shows word (16-bit) alignment. Long word (32-bit) alignment would result in 5 bytes being wasted for every 3 bytes of data!

Most embedded compilers have a switch to select what kind of code generation and optimization is required. However, there may be a situation where you decide to have all your data unpacked for speed, but have certain data structures where you would rather save memory by packing. In this case, the language extension keyword packed may be applied, thus:

   packed struct
   {
      short two_byte;
      char one_byte;
   } my_array[4];


This overrides the optimization setting for this one object.

Alternatively, you may need to pack all the data to save memory, and have certain items that you want unpacked either for speed or for sharing with other software. This is where the unpacked extension keyword applies.

It is unlikely that you would use both packed and unpacked keywords in one program, as only one of the two code generation options can be active at any one time.

Other data optimizations
Space optimization. As previously discussed, modern embedded compilers provide the opportunity to minimize the space used by data objects; this may be controlled quite well by the developer. However, this optimization is only to the level of bytes, which might not be good enough.

For example, imagine an application that uses a large table of values, each of which is in the range 0 to 15. Clearly this requires 4 bits of storage (a nibble), so keeping them in bytes would only be 50% efficient. It is the developer’s job to do better (if memory footprint is deemed to be of greater importance than access time). There are broadly two ways to address this problem.

One way is to use bit fields in structures. This has the advantage that a compiler can readily optimize memory usage, if the target CPU offers a convenient capability. The downside is that bit fields within a structure cannot be indexed without writing additional code, but this is not too difficult. The following code shows how to access nibbles in an array of structures:

   struct nibbles
   {
      unsigned n0 : 4;
      unsigned n1 : 4;
      unsigned n2 : 4;
      unsigned n3 : 4;
   } mydata[100];

   unsigned get_nibble(struct nibbles words[], unsigned index)
   {
      unsigned nibble;

      nibble = index % 4;
      index /= 4;
      switch (nibble)
      {
      case 0:
         return words[index].n0;
         case 1:
      return words[index].n1;
      case 2:
         return words[index].n2;
      case 3:
         return words[index].n3;
      }
   }


A similar put_nibble() function would be required, of course.

The other way to code a solution would be to perform all the bit shifting explicitly in the code, which is really just emulating what the compiler might generate. It is unlikely that a human programmer could produce code substantially more efficient than a modern compiler.

Speed optimization. There is little a developer can do to improve speed of access to data beyond the optimization that the compiler does (i.e., not packing the data for fast access). But one option is to locate data in the fastest available memory. An embedded toolchain includes a linker, which will normally have the flexibility to effect this optimization. This opens up a few possibilities for consideration:

The fastest place to keep data is in a CPU register, but these are in short supply and should be used sparingly. Most compilers make smart choices for register optimization.

RAM is the fastest type of memory in most systems. Obviously, variables tend to be located in RAM, but it may be worthwhile to ensure that constant data is copied into RAM as well. This is commonly done automatically, as code is normally copied from flash to RAM for execution.

Microcontrollers typically have on-chip RAM, which is faster than external memory. So ensuring that speed-critical data is located there makes sense.

Memory is commonly cached into an internal buffer for fast access. Some CPUs permit locking of a cache so that the contents are always immediately available.

1 comment: