How to detect Stack Overrun (possible variable corruption)

lukeevanslx · Joined: 11 Jun 2012 Posts: 14

How do I detect stack overrun (eg. the stack is corrupting variables)?
What symbols am I looking for in the .lst file?

Or does CCS actually use the stack for auto variables, or are auto allocations just a scatchpad area of RAM calculated to the proper size?

(I tried searching "stack" in the help but no good hits. Perhaps I can be directed to the correct topic in the manual?)
Thanks

RF_Developer · Joined: 07 Feb 2011 Posts: 839

You don't check for stack overflow. As far as I am aware there is no way to do so, at least on the 16s and 18s. There may be on the 24s and 32s, but don't hold your breath on that.

On 16s and 18s, and probably 24s, the stack, a hardware stack NOT implemented in data memory and therefore not going to overwrite varaibles on overflow, is far too small to store any variables. It more or less stores return addresses only, This is why recursion is difficult if not practically impossible on the lower/mid range PICs. The 32s are likely to be different as they are MIPs based, while I have little experience of the 24s/dsPICs

To check the stack usage look at the top if the .lst file. Here is an example from one of my projects:

Ttelmah · Joined: 11 Mar 2010 Posts: 19447

Ongoing:

The PIC 18 chips, have the ability to trigger a reset on stack overflow (which can then be detected with 'restart_cause'). However because the stack doesn't store variables, it would normally only get triggered through code/memory corruption (atomic particle for example), or a problem with the code design.
There is also a 'caveat' on checking the stack size used in the listing - remember that if you are using the ICD, _this_ uses stack space, and if you have a bootloader, this can also add to the stack space used. You need to work out how much stack really is available, before using the listing.

Best Wishes

FvM · Joined: 27 Aug 2008 Posts: 2337 Location: Germany

With PIC18 dedicated hardware stack, a stack over- or underrun is causing a reset, as you can review in the processor datasheet.

PIC24 has a stack in general RAM, but a stackoverflow is causing an execption and by default a reset.

jeremiah · Joined: 20 Jul 2010 Posts: 1337

Ttelmah · Joined: 11 Mar 2010 Posts: 19447

Realistically the most likely cause for 'corrupted variables', is a pointer overrun.

You have to remember the PIC has no hardware memory management/protection, so there is nothing to stop you from writing to the wrong address in memory. So (for instance):

asmboy · Joined: 20 Nov 2007 Posts: 2128 Location: albany ny

To add to what Ttelmah has written,
another common memory cruncher is
writing to an array element at a greater index
than was allocated. Or incorrectly storing to a circular index buffer etc.

CCS does NOT do bounds checking.
As a programmer you need to pay attention to things like that as there is NO RUNTIME CHECKING .

In runtime, you are performing without a net at all times.
The smallest slip up will throw you to the canvass.

Shocked

Douglas Kennedy · Joined: 07 Sep 2003 Posts: 755 Location: Florida

Atomic particles messing with the PIC's movement of electrons is low on the list of causes since most often it is human error. As suggested first you look at your code twice. If that doesn't find the error and you have a PIC with restart on stack error then CCS has __line__ . You assign it to a variable in strategic places and in the restart clause you send it to a monitor (pre crash RAM is preserved on a restart ( assuming it wasn't power fail))..I use ICD_U64 and the CCS debugger for this. Now you have some idea as to the last line number that your code passed before crashing. It's not perfection but it can narrow it down. Now it is possible an atomic particle alters your brain waves when trouble shooting and you miss the whole thing....so there is no perfect answer.

Ttelmah · Joined: 11 Mar 2010 Posts: 19447

Yes, though I have seen it. Only on a couple of machines fitted into a site where mining had gone into some fairly radioactive rocks, but it does happen. Also anyone involved with CCD's will testify just how often they can record a stray particle. However put it about fiftieth 'down' the list of things likely to go wrong. Smile

However key words to the original post, are 'stack is corrupting variables'. No, on the PIC16/18's, stack overflow _will not cause variable corruption_. There is no variable stack as such.

'Top ten', in order of likelyhood:

Incorrect pointer count.
Incorrect array index.
Incorrect sizing of array/pointer passed to a function - this is both a power, and a danger of C, where you can pass (say) pointer to an int8 to a function, and tell the function that it is a pointer to something larger, then find yourself talking to values far beyond the end of the physical array.
Incorrect handling of malloc.
Not disabling interrupts when passing multiple byte values too/from and interrupt handler.
Noisy PSU. RAM corruption through poor supply regulation.
Compiler error (There have been a few particularly when handling complex structures crossing page boundaries).
RF induced problems.
Spikes from lines into the PIC. Particularly MCLR. This _does not_ have the protection diodes present on other pins, and if used as an input, these should be provided externally, or spikes just a little over Vdd, can cause RAM corruption.

Best Wishes

Douglas Kennedy · Joined: 07 Sep 2003 Posts: 755 Location: Florida

Ttelmah says the following is a frequent cause of trouble.

Ttelmah · Joined: 11 Mar 2010 Posts: 19447

Disabling the interrupts and copying is the way to go. The time needed to move four bytes is only perhaps eight machine cycles, so unless the interrupt can receive a fifth byte, and not buffer this, in this time, then it gives 100% coverage. Alternative is to use alternating buffers, and as soon as a packet of bytes are received, switch to the second buffer, and flag this, then in the main code, move the bytes from the buffer not now used, which since it is not the buffer in use, doesn't need interrupts disabled.

Best Wishes

Douglas Kennedy · Joined: 07 Sep 2003 Posts: 755 Location: Florida

Thanks Ttelmah,
I believe you are confirming that the interrupt pending flags are set even when interrupts are disabled something I assumed to be true and the risk is only that the hardware buffer overflows while main has the interrupt blacked out. So the latency incurred by the instructions needed to get the isr up to the point it can pull in data and get back out are in fact more of an issue than the blackout time in main ( a few instructions to move 4 bytes ). This is good news since it is deterministic in that data can't arrive faster than the can_bus baud rate and the isr call instructions are also determinable... so in my case with my baud rate it is mathematically impossible ( to lose data). The can_bus is purported to only lose 1 bit every hundred years of continuous use. ( this assumes the hardware ex. oscillators wiring etc are perfect for 100 years) I didn't want to destroy this reliability with my coding.

Ttelmah · Joined: 11 Mar 2010 Posts: 19447

Oh, yes.
This is in the chip's data sheet, and is why you can leave interrupts disabled, and poll the interrupt flag, as an alternative to using an interrupt handler. This is also why you have recommendations like clearing an interrupt, before enabling it, for things like the 'edge' interrupts, where the act of programming the edge used, can trigger the interrupt bit, though it is disabled.
the 'interrupt enable' bits only control whether the interrupt flag will result in an interrupt call. The sequence (for a PIC18) is:

Interrupt flag - set when the hardware event happens. Doesn't care about any of the other bits.

Interrupt enable - sets whether each flag is connected to the interrupt hardware

Priority bit - sets which of the two interrupt hardware sections each interrupt signal is connected to. Beware that INT_EXT, does not have this bit and always connects to the high priority hardware if priorities are enabled.

GIEH/GIE enables the hardware call logic for each of the two sources.

An actual interrupt 'call' only occurs if the entire 'tree' of bits is set correctly, but the very first one (the flag), doesn't care about any of the others.

Best Wishes

asmboy · Joined: 20 Nov 2007 Posts: 2128 Location: albany ny

re: disabling ints

for quite some time i have made been making use of
timer 'tix' that are incremented as a 32 bit int by a timer ISR.

when i want to know the actual tix count -
i never disable ints in order to read it.

this code, while not hyper efficient , has never failed , that i know of.