Indexing through large arrays; more efficient way?

RoGuE_StreaK · Joined: 02 Feb 2010 Posts: 73

I'm playing back some sound effects stored in the PIC's internal memory; seems to play fine, but I'm wondering if there's a "better" way of doing this?

example array:

Ttelmah · Joined: 11 Mar 2010 Posts: 19882

This is where the classic crossover between an array, and a pointer, can be useful.
For instance (simplifying to a single dimensional array):

FvM · Joined: 27 Aug 2008 Posts: 2337 Location: Germany

It should be noted, that flash array acesss is actually performed by table read instructions, which involve a considerable overhead. For time critical applications, care should be taken to organize it efficiently by reading blocks of a certain size at once rather than individual words.

PIC24 has an additional feature named PSV (program storage visibility) but it can only map up to 32 kB flash to data space.

RoGuE_StreaK · Joined: 02 Feb 2010 Posts: 73

OK... been off reading up on pointers, trying to get some sort of grasp, hopefully I've made some headway.

First up though, should I actually be declaring my "snd_array" as "const int8" rather than "const char"? I had a whole heap of trouble originally getting the array into the code, the const char in the format shown was the only way I could get it to work, but I don't believe I tried const int.

From the sounds of it, using a single dimension array would be a lot easier to use with pointers, but again I had issues trying to get everything in as a one-dimensional array, so had to (?) split it up. But if anyone knows of a trick to get it to work as one-dimensional, it would greatly simplify my code, conversion processing, and keeping track of where the sound is up to.

With a two-dimensional array, I can't find any C examples that make sense (to me) to bring over to CCS; does it need pointers within pointers (as it's an array of arrays), or can one pointer be used to point to all of the dimensions, where
*(n_ptr+16)
gives the next internal array? (ie changing pointer value from flash_array[0][0] to flash_array[1][0]?)
I think I grasped the single-dimensional pointer bit OK, but going multi-dimensional is doing my head in!

Then again, with FvM's note, should I point to a single sub-array, and copy it's contents (16 cells) to a variable array for quicker access? Or is this a moot point if using pointers?

RE: PIC24's PSV, I just had a quick look into that, from how I read it is this an automatic thing, so moving to PIC24 would negate some of these changes? Although I'm using 64K chips, at the moment it's looking like my sound samples may well total less than 32K, so does this mean it would be a non-issue anyway?

Ttelmah · Joined: 11 Mar 2010 Posts: 19882

The point about the pointer, versus the array, is maths.

If you have an array, with two indexes [a & b say], then when you access:

array[a][b], the compiler has to take 'a', multiply it by the size of the row, then add b, and only then perform the table lookup to read the element. Quite a lot of arithmetic.

Now if you have an array like this, and declare a pointer, then initialise this with:

ptr=array;

then *ptr is the same as array[0][0].

However if you not increment ptr, it will address array[0][1]. Keep on going, till you have incremented it 15 times, and it now addresses array[0][15]. Increment it again, and it now addresses array[1][0]. A two dimensional array is still just a linear block of data in memory, and incrementing with the pointer, removes the need to keep recalculating the product each time. Not a big saving on the chips with hardware multiply, but still a good handful of instructions.
So a single pointer can walk through the entire table from any location you want.

Separately, even doing this, the code will have to setup the table lookup for each element in turn. The alternative to this, is what FvM is talking about, which is to read a whole row, when required. So (for example):

RF_Developer · Joined: 07 Feb 2011 Posts: 839

Edit: This overlaps a lot with Ttelmah's post done at the same time.

There's a lot to think about here. First, consider a classic Von Neumann architecture, with one bus for both instructions and data and one one memory space and no caching, indeed just such a machine for which C was initially developed.

Many such machines include an auto-incremented addressing mode of some sort. These allow the processor to efficiently step through memory. A compiler for these machines may, depending on its code generation strategy, be able to leverage such instructions to produce optimised sequential accessing of arrays, assuming, and its a pretty big assumption, that it can recognise such accesses in the C code.

Even without such hardware assistance, sequential pointer access to the elements of an array are likely to be more efficient - faster and smaller code - than indexed access. The point is that indexed access, using C array indexing - requires a multiplication by the size of the element except when the element is the size as the granularity of the machine's addressing. So, indexing bytes on a byte addressable machine is simple, accessing 32 bit words requires multiplication of the index by 4. Multiplying by binary powers is normally simple and quick as they can be done by shifts, but is generally more complicated when the element size, such as with an array of structures, is not a power of two.

There is also a hidden overhead on many modern wider word machines. For example the ARM 7s which are 32 bit machines but are byte addressed, the familiar x86 architecture also. For such machines the fastest, simplest way to access memory is in words aligned to four byte boundaries. Its faster to access properly aligned 32 bit words than individual bytes, which have to be extracted from words, the rest of which may or may not be redundant.

Two and more dimensional arrays require an additional multiplication for each extra dimension. This can soon get expensive on time and code space, especially with machines that offer little or no hardware assistance for multiplications.

Pointers can provide useful efficiency gains, particularly speed of access compared to indexing provided the accesses are sequential, i.e. stepping through arrays. If you require random access then pointers are generally pointless. C doesn't technically have multidimensional arrays, all are just one dimension, instead it has arrays or arrays, and each "dimension" has its own indexing. Even then you must be aware of the order in which dimensions of the array are stored. For C and C++ the rightmost dimension has the fastest changing address, so stepping through the array is much more efficient one way than any other.

All this falls apart with C# however. In C# there are no pointers as each element is a full blown object with all the overhead that entails, which is considerable. So with C# there are other ways to iterate through arrays, and many other forms of collections of data than the simple array.

Back to C. All the above holds true with most processors, but what if the processor is NOT a Von Neumann type? What if its a Harvard architecture processor, like the PIC?

PIC 16s and 18s have data memory that is both paged and separate from code memory. This allows a compact instruction set and simple pipelining as instruction and data access can be done in parallel over their separate buses and memories. The down side is that access to data memory is limited to the page size, 256 bytes, after that the paging needs to be changed to access another 256 byte data memory page. Pointers are not simply addresses, they are page, offset pairs and are more complex to manage than simple linear addresses. All this is taken care of for us by the compiler, but its generally a matter of luck whether even small arrays are located in one page, and it can vary from one compilation to the next.

Pointers to data memory still work more efficiently than indexing however, especially as the 16s and 18s only have a 8 x 8 multiplier. This means that indexing can is at it is most efficient when the array is small, less than 256 bytes, or element size is a simple power or two.

Constants are generally stored in program memory by CCS. This makes for a lot of extra work for the processor as it has to use the time consuming and unpipelinable table read/writes to access the program memory. The difficulties of index to address conversion still apply. It may well be more efficient overall to cache blocks of such data in data memory - reading a block of say 256 bytes of data in to data memory then accessing that by pointers - than to pull each and every value off one by one. Generally I suspect many sound generators will want data in blocks anyway, so grabbing a block and sending a block as one will often make more sense.

Some hardware assists can help here: DMA type transfers (rare on PICs however due to simplicity of the internal busses), interrupt driven SPI and I2C transfers.

PIC 24s should be a much better bet for this sort of thing. They have wider multipliers that make indexing simpler. They don't have paged data memory, simplifying data memory accessing. They have PSV (Program Space Visibility) which maps a decent chunk of program memory into data memory address space, making reading of blocks of constant data relatively simple. If the compiler can leverage all this then it can make a much better job of generating decent code. I confess I haven't worked with any 24s so I can't test any of this.

Optimisation is all about knowing these limitations and working within them, using the hardware to its best advantage. What's best on one processor might actually be worse on another. Even in Intel x86 processors optimisation, such as in Intel's own optimised libraries, is done on a processor by processor basis, taking into account all the peculiarities of architecture.

RF Developer

PS: for 18s char, int and int8 are all pretty much the same thing and should be treated the same by the compiler.

ckielstra · Joined: 18 Mar 2004 Posts: 3680 Location: The Netherlands

Ttelmah · Joined: 11 Mar 2010 Posts: 19882

I think you will find it fails if you try to initialise all the entries. There is a compiler limit that seems to be hit at several thousand characters in a single unbroken initialisation. Hence where you actually hit is varies with the data format used for the entries, but several people have hit this. A couple of thousand entries seems reliable, but much beyond, can cause problems.

Best Wishes

RoGuE_StreaK · Joined: 02 Feb 2010 Posts: 73

John P · Joined: 17 Sep 2003 Posts: 331

When I needed to read large amounts of data from program ROM (on a PIC16F877A) I used a method that's close to cheating, if there is such a thing. First I compiled the program and noted how large the HEX file was. Then I made up a new HEX file offline, containing the data, and I made sure it started at an address slightly higher than the last program address. I went back to the C code, and plugged in the address that my data resided at, compiled it again, and then I hand-edited the compiler's HEX file to add my file after the compiler's output (this wouldn't have been necessary if my programming system didn't do an erase before loading the HEX file). Finally I programmed the chip.

Then when I wanted to read the data, I would first load the EEADR and EEADRH registers with the base of my data area plus any offset that was needed. Then I'd use the Microchip procedure to get the data from program ROM. I didn't quite trust CCS to do this the way I wanted, so:

ckielstra · Joined: 18 Mar 2004 Posts: 3680 Location: The Netherlands

The solution provided by Ttelmah using the read_program_memory() function is already a lot faster than the original code but still has the disadvantage that data is to be read in multiples of 16 bytes while the original sound data has variable length.

The function read_program_memory() is a wrapper around the hardware registers for reading from program memory, TBLPTR and TABLAT, and reads a datablock of the specified length. The function read_program_eeprom() is similar but reads a fixed length of 1 program word (2 bytes). This is more flexible but returns a word where you want byte access.

Looking at the disassembly code for these functions you see there is a very effective assembly instruction being used which reads one Flash memory byte and advances the pointer to the next address (TBLRD*+), all in just two clock cycles. If only you had access to this command from C-code... Luckily you can.

Here is a demonstration program which:
- accepts a starting memory address
- accepts a random data length
- outputs the read data directly to your sound output function.

Because now all the work is done in one large loop there is less overhead for initializing the registers again and again. And you can specify a random data length instead of a 16 byte multiple.

The loop for writing 1 byte to the output in this program takes just 14 instruction cycles. 1 instruction less when you use fast_io.

bkamen · Posted: Sat Feb 18, 2012 12:21 pm

ckielstra · Joined: 18 Mar 2004 Posts: 3680 Location: The Netherlands

bkamen · Posted: Sat Feb 18, 2012 6:06 pm

Ttelmah · Joined: 11 Mar 2010 Posts: 19882

Just one little 'comment' that may help in thinking out this type of storage.

There are two fundamentally 'different' ways of declaring a block of data in the program memory. The first (used here so far), is to just declare a variable as 'const'. This then builds a table containing the data, with at it's 'head' the code to retrieve the data. Plus side - you don't have to worry about allocating space for it - the compiler does this for you, relocating it if needed etc. etc.. Down side, you don't know 'where' the actual data is!. This is where 'label_address' comes in, telling you where the actual data 'table' associated with the variable is placed. You can also read the data if required just as if it is in RAM.
Second method, is the #ROM declaration. This _just_ puts a table containing the defined data, without any extra code. Plus is that you know exactly where it is!. Down side is that you then have to access the elements yourself, and have to work out the locations to put it. You can for example have ten successive 1K #ROM statements, declaring a 10KB block of data, with no overhead, and filling the whole of a 10KB block of your chip's ROM (assuming the chip has this much ROM...).
If one if going 'DIY' on the access code for speed, then it is probably worth switching to using a #ROM declaration for the data.

Worth also realising that you can 'encapsulate' your fetching code. So you just call a routine with an address, and it retrieves an entire block of X bytes around the specified location, then if you access something it already has fetched, just gets the byte from RAM, rather than reading the program memory again. This is how disk accesses are done, with you not having to worry that the block size is (say) 512 bytes, just being able to fetch the byte at location 12345 on a file, and the code automatically reads a sector, and returns the required byte. If you then ask for the next byte, it returns this from the buffered version in RAM.....

Best Wishes