|
|
View previous topic :: View next topic |
Author |
Message |
JamesW
Joined: 23 Apr 2007 Posts: 91 Location: Rochester, England
|
Will a processor crash, stop interrupts being serviced? |
Posted: Fri Jul 06, 2012 3:23 am |
|
|
Hi Folks,
Working on a device at the moment that keeps locking up. (Processor is an 18F26K80).
I'd originally enabled a watchdog on the devices, but once we'd built 24 of them - we noticed that they appeared to be resetting at spurious times.
Having increased the watchdog delay considerably, they appear to be crashing (or at least the main code loop has stopped executing).
However - my interrupt service routines are still running. (I know this, becuase I'm now toggling a pin in my main loop, and another in the timer ISR). The main loop has stopped, and the timer is still going. (Not to mention one of them is moving a motor a set number of steps - and the motor will move this number of steps, even after a crash)
I'm trying to work out if the code has just got stuck somewhere in a loop, or the processor has crashed.
If a processor has crashed (due to stack or other) will the Interrupts stop being handled?
Cheers
James |
|
|
RF_Developer
Joined: 07 Feb 2011 Posts: 839
|
Re: Will a processor crash, stop interrupts being serviced? |
Posted: Fri Jul 06, 2012 4:05 am |
|
|
My thoughts, as a lot of this set alarm bells ringing...
JamesW wrote: |
Working on a device at the moment that keeps locking up. (Processor is an 18F26K80).
|
What does "locking up" mean? I'm not at all sure what's really happening here.
Quote: |
I'd originally enabled a watchdog on the devices, but once we'd built 24 of them - we noticed that they appeared to be resetting at spurious times.
Having increased the watchdog delay considerably, they appear to be crashing (or at least the main code loop has stopped executing).
|
The watchdog is there to reset the processor if something goes wrong, such as the processor "locking up", or executing incorrect code, or getting stuck in a loop somewhere. It can also mean there's a fundamental hardware weakness, such as poor decoupling, noisy supplies and environment, especially with RF, or maybe as here, with DC/AC power as may be associated with motors.
The watchdog should never time out. That it is means something is wrong. Ignoring that warning is not a good thing, which is what you are doing by extending the watchdog timeout period.
Quote: |
However - my interrupt service routines are still running. (I know this, becuase I'm now toggling a pin in my main loop, and another in the timer ISR). The main loop has stopped, and the timer is still going. (Not to mention one of them is moving a motor a set number of steps - and the motor will move this number of steps, even after a crash)
|
Any number of things may have happened. All of which are fundamental programmer and hardware design engineer issues. If this is CCS code, then if the mainloop ever exits, and of course its usually a while(true), or for (;;) which should never exit, then there's an unseen sleep after the main exits. Interrupts may well still operate, but the mainloop is no more: you've run out of main code to execute.
Another worrying point it that you mention your interrupt generates "a set number of steps" for a motor. That means the isr has almost certainly got delays of the order of milliseconds in it. This is bad and shows a lack of understanding/experience of ISRs and embedded programming in general. Its a very common beginners fault, most if not all of the ISRs we see here suffer from this. That alone should prompt a total rewrite of the code.
Quote: |
I'm trying to work out if the code has just got stuck somewhere in a loop, or the processor has crashed.
|
Get the debugger out and start live debugging with breakpoints and single stepping. That's what it for. You should fairly quickly find out what's going on. It may not be that the processor has "crashed" as such, whatever that means in this context. Much more likely that its executing buggy firmware.
Quote: |
If a processor has crashed (due to stack or other) will the Interrupts stop being handled?
|
Either the crash causes a reset, and so your code restarts, or the watchdog times out ans causes a reset. In both cases, restart_cause() can be used in your start up code to determine if the start is normal or dodgy. Its also possible that the processor has reset but your code doesn't handle warm start well.
All this is pretty basic embedded engineering. One trouble is that many engineers these days are expected to be able to write C for any project without any training nor experience whatsoever. Is it any wonder that the code they tend to generate is all over the place.
RF Developer |
|
|
JamesW
Joined: 23 Apr 2007 Posts: 91 Location: Rochester, England
|
|
Posted: Fri Jul 06, 2012 4:27 am |
|
|
Hi,
Thanks for the tips - I will at this junture, point out that I have a degree in digital electronics, run my own electronics business, and been programming embedded software, and device drivers for over 20 years! (in pic, stm32 & VxWorks!)
With regards to your concerns regarding an isr driving a stepper motor, this is a simple timer ISR that is running relatively slow, that checks a long int value. If the value is greater than 0, it decrements the value by 1, and does an output toggle on the motor clock pin.
There are NO delay ms lines, anywhere in the code - this is running flat out.
There is a standard main loop in the code - but this main loop is stopping executing - the question I have asked is pretty simple, and is aimed purely at tracking down the bug.
If for whatever reason my code is crashing the processor, will it dissapear off with the pixies, and stop servicing ISR's?
My code is still servicing the five ISR's in use (timer1, timer2, RDA1, RDA2 & TBE2 - this I have verified by toggling hardware lines in various ISR's). But my main loop has stopped looping.
I am trying to track down, whether the code has not actually crashed - but is stuck somewhere in a sub loop. This will give me a bit more of a pointer
Restart cause does not give me an answer, as in order to get this result I have to manually restart the processor - and so get a normal power up / MCLR_FROM_RUN result code.
Sorry for the confusion
James |
|
|
JamesW
Joined: 23 Apr 2007 Posts: 91 Location: Rochester, England
|
|
Posted: Fri Jul 06, 2012 4:38 am |
|
|
Just to also add to the confusion, the PIC18F26K80 shares it's second USART (which is used to supply status information to the user) with PGC and PGD.
Hence if I enable the debugger, I won't be testing a true system. |
|
|
Mike Walne
Joined: 19 Feb 2004 Posts: 1785 Location: Boston Spa UK
|
|
Posted: Fri Jul 06, 2012 4:47 am |
|
|
OK, you think that your main() code is getting lost.
I'm assuming that your main() executes a series of functions in a tight loop.
In that case make each of your functions generate a unique series of pulses on any spare I/O pins you may have.
That way you should be able to track which functions are operating and at what stage it all stops.
Once you know which function performed last, you can go down a level to its sub-functions.
As I understand it, the ISR's will operate until you disable them. So, if your stuck in a loop (with or without pixies) the ISR's should still work.
Mike |
|
|
JamesW
Joined: 23 Apr 2007 Posts: 91 Location: Rochester, England
|
|
Posted: Fri Jul 06, 2012 4:57 am |
|
|
Hi Mike,
At this current moment - I'm not sure, and as is typical the unit stops working once in a blue moon (It stopped at some point overnight).
In a nutshell the unit is doing the following
- Controlling a set of clock hands using 2 stepper motors
- reading standard NMEA0183 packets from a GPS receiver into the serial port
- Decoding the packet, doing a shed load of maths on it to convert the time from UTC into local time with summertime correction on it.
- moving the hands to where they should be, and keeping track of time.
- Sending status information back to the user on uart 2
As the crash occurs so infrequently, I'm trying to work out if this is a major system crash (div by 0, Program count error, stack overflow etc..) Or if it is just sitting somewhere waiting for something to happen that hasn't. (I have been through the code, and can't see any obvious places)
If we've actually crashed for some reason, will the interrupts stop happening?
Thanks
James |
|
|
temtronic
Joined: 01 Jul 2010 Posts: 9225 Location: Greensville,Ontario
|
|
Posted: Fri Jul 06, 2012 5:03 am |
|
|
Mike's right on here...take a page from the POST program that every PC has. Set a set of leds (or send a terminal a msg) to indicate which step of 'main' is executing, When it fails, it is the NEXT function that has the problem. 'old school' diagnostics but it works well. |
|
|
JamesW
Joined: 23 Apr 2007 Posts: 91 Location: Rochester, England
|
|
Posted: Fri Jul 06, 2012 5:10 am |
|
|
The two spare pins on the processor are now wired, programmed and ready to light the way.
Downside - it's been running now for 20 minutes, and hasn't missed a beat! This could take a while.
Thanks for your help chaps. |
|
|
RF_Developer
Joined: 07 Feb 2011 Posts: 839
|
|
Posted: Fri Jul 06, 2012 5:52 am |
|
|
As this is a GPS synced clock you need it to run 24/7/52.
Quote: |
...major system crash (div by 0, Program count error, stack overflow etc..)
|
As you'll know from your extensive experience, the small to mid range PICs, 16s & 18s, don't have hardware division - its always done by software hence cannot "crash" the processor. Also many don't have any hardware stack checking. Those that do generate a reset, and hence a checkable restart_cause, on overflow. Anyway, you'll have looked at the resource usage stats at the top of the listing and checked that the worst case stack usage is within the capabilities of the processor, so stack overflow is out of the running (still possible if you've not done the checks of course). Program count error... not entirely sure what that means, but 18 series PICs don't have memory protection so accessing a non-implemented area of program memory is possible... its not well documented, if indeed its documented at all, but clearly some unused opcodes are interpreted as nops. Its also possible that some are interpreted as invalid op codes and possibly cause a reset, in which case restart_cause may help you.
In many processors, especially risc types, unused op codes are not decoded meaningfully, as to implement such logic would be wasteful and may slow the processor. Hence odd things may happen, there again the most likely unexpected op codes, all Fs or all 0s, are both definitely interpereted as nops. Hence unprogrammed flash is Nops all the way to the end at which point the PC wraps and goes back to the start of program memory which is of course the reset vector so the PIC will restart. Its difficult to "crash" the PIC processor in a way that means it can't recover. Much easier to "crash" software of course.
Anyway it seems likely that your code is looping forever somewhere in your code. Some logical/program flow bug seems likely. My gut feeling is perhaps something in the NMEA message parsing, maybe there's something in the message stream that it doesn't handle well, but which is only occasionally gets from the GPS, some sort of unhandled error response or odd not understood string. The recent leap second caused havoc when online and systems couldn't understand why timeservers reported there was a 61st second (at 23:59:60) in a minute. I suspect that may have caused many GPS time based systems some "issues".
Its a pity, and a significant practical issue, that you can't use the debugger due to a resource clash. Clearly that's something you'll want to avoid on your future projects: there's little to be gained by knowingly tying your hands behind your back. So you'll have to fall back on flags as the other have suggested: try to narrow down step by step where its going wrong.
Good luck.
RF Developer |
|
|
Mike Walne
Joined: 19 Feb 2004 Posts: 1785 Location: Boston Spa UK
|
|
Posted: Fri Jul 06, 2012 6:01 am |
|
|
Don't know if this will help.
In the dim and distant past I had to diagnose a rare intermittent on a UART. It was only known that there was a problem after the event. A digital 'scope was set up in pre-trigger mode. The 'scope was triggered after a faulty transmission. In that case the trace data (all of it before the trigger) was then saved. Only the data for faulty transmissions was recorded for analysis.
You could do something similar. Put your watchdog back in. Send your diagnostic LED data to a 'scope set in pre-trigger mode (i.e. trigger set to right-hand side of screen). Use the watchdog restart to trigger the 'scope and save or analyse the pre-fault traces.
[ Or you could use the USART, on a temporary basis, to send out an ASCII character as it enters & leaves each function, and a sensible message on re-start. Save the messages to a PC. You then only have to trawl through looking for the characters ahead of the restart messages.]
If the problem is with parsing, try sending your own messages at a higher than normal rate to speed up testing. You could include difficult and/or garbage messages to test the handling.
Mike |
|
|
SherpaDoug
Joined: 07 Sep 2003 Posts: 1640 Location: Cape Cod Mass USA
|
|
Posted: Fri Jul 06, 2012 6:59 am |
|
|
James, as another degreed engineer with decades of embedded development experience I think your problem is mainly one of mindset.
The word "crash" as applied to microcontrollers is too vague and should be banished from this conversation. The things that stop a uC from executing their program as they see it, such as loss of VCC, bad solder joints, thermal fracture of the die, die mask errors, etc. are rare and don't seem relevant to this problem. Even bits flipped by cosmic rays don't stop the processor from running something.
If your PIC is responding to interrupts it must have VCC and be clocking. So it is running some code somewhere. You just have to find out where. If it has run off the end of main() it will be stuck eternally executing sleep. If there is a hardware Reset issue it will be running initialization code between resets. Note interrupts generally default to enabled so you could still execute interrupts but never get into main(). The two spare pins are what you need to find where your PIC is going. And of course it is always hard to fix something that won't stay broken!
Your PIC is running some code somewhere. You just need to find out where. _________________ The search for better is endless. Instead simply find very good and get the job done. |
|
|
JamesW
Joined: 23 Apr 2007 Posts: 91 Location: Rochester, England
|
|
Posted: Fri Jul 06, 2012 7:05 am |
|
|
Hi Folks,
2 Hours in - and I know where the crash isn't (the area I thought it would be - the packet decoding of the satellite data, and subsequent maths).
I've now moved the pins, and am re-running the code.
Max stack useage 9 out of 31, max ram useage 27%.
I am only actually using the satellite to update the time on power up, and at midnight. From experience that as the satellites move around the Earth, there are times of the day when you may lose a valid fix - and I don't want the clock hands to stop moving at this point. The satellite packet is used to update a real time clock running on a quartz crystal. The quartz is used to calculate the position of the hands.
The reason I only update at midnight is that the NMEA packet is only really accurate to +/-1 second - especially as the GPS just squirts packets out - as a quartz crystal can drift a bit it's possible that the quartz moves to second X+1 (and hence advances the second hand) the packet then comes in and sets the time back to (X) and then the Quartz again ticks to X+1 and so you have moved more seconds than you want, and get hand drift.
Thanks for the help so far.
James |
|
|
JamesW
Joined: 23 Apr 2007 Posts: 91 Location: Rochester, England
|
|
Posted: Fri Jul 06, 2012 10:09 am |
|
|
Now this is interesting, Having been playing with the unit all day - it seems to be stopping at roughly the same point, when looking at the RS232 output, and my two debug pins.
I have been originally doing it the "proper way" - ie printing the debug information to a buffer, and using the transmit buffer empty interrupt to empty the buffer and squirt the data out.
About 3 hours ago - I've removed all of this, and replaced it with a simple putc instead - and it seems to be running without incident.
So the golden question is - is there an obvious bug in the code below
Code: |
#use rs232(baud=115200, BITS=8, PARITY=N, STOP=1, xmit=PIN_B6, rcv=PIN_B7, ERRORS)
void bputc(unsigned int8 NewChar)
{
putc(NewChar);
}
char DebugGetc()
{
return getc();
}
#define T_BUFFER_SIZE 250
unsigned int8 t_buffer[T_BUFFER_SIZE];
unsigned int8 t_next_in = 0;
unsigned int8 t_next_out = 0;
#int_tbe2
void bserial_isr()
{
output_high(PIN_A2); /* V1.19 DEBUG */
if(t_next_in!=t_next_out)
{
bputc(t_buffer[t_next_out]);
t_next_out = t_next_out + 1;
if (t_next_out >= T_BUFFER_SIZE)
t_next_out = t_next_out - T_BUFFER_SIZE;
}
else
{
disable_interrupts(INT_TBE2);
}
output_low(PIN_A2); /* V1.19 DEBUG */
}
/* ------------------------------------------------------------------------- */
void DebugPutc(char c)
{
short restart;
int ni;
restart=t_next_in==t_next_out;
t_buffer[t_next_in]=c;
ni=(t_next_in+1) % T_BUFFER_SIZE;
while(ni==t_next_out);
t_next_in=ni;
if(restart)
enable_interrupts(INT_TBE2);
}
/* ------------------------------------------------------------------------- */ |
This code is a rough hack of one of the ccs examples
cheers
James |
|
|
newguy
Joined: 24 Jun 2004 Posts: 1907
|
|
Posted: Fri Jul 06, 2012 2:51 pm |
|
|
You have to be very careful of the TBE interrupt - any small error or vulnerability in your transmit code, no matter how slight, will cause the TBE interrupt to continually fire. Been there, done that, 5 or 6 times now.
From what you describe it sounds like your program isn't properly handling transmit code. I haven't had a really close look at your code, but one thing does stand out: you only transmit if there is a character mismatch, which begs the question: what if they're the same and not different?
The way I usually handle a transmit interrupt is to load a buffer with a message and keep a tally of the characters that need to be transmitted. If a transmit isn't already in progress, start one and enable the TBE interrupt. The TBE interrupt then checks to see if the number of transmitted characters matches the number of characters to be transmitted - if so, it's done and the TBE interrupt must be disabled. If not, fetch the next character in the buffer and load it into the USART, and increment the number of characters transmitted. Just be sure to keep separate copies of critical things like buffer indexes (one for writing, one for reading), and numbers of characters transmitted/to be transmitted. The only "gotcha" is for large arrays which require a 16 bit index for a program running on an 8 bit processor. You need to disable interrupts just before doing any increments/comparisons on the indexes, since this method uses them to enable/disable the TBE interrupt. Since an 8 bit machine can't directly manipulate 16 bits at once, this is a potential trouble spot. |
|
|
asmallri
Joined: 12 Aug 2004 Posts: 1634 Location: Perth, Australia
|
|
Posted: Sat Jul 07, 2012 10:53 pm |
|
|
I had a tricky code lockup situation a few years back where my main code locked up however the ISR's functioned correctly. The way I tracked the problem was to add an addition switch input to be pressed in the event the application locked up. The ISR handler checked the button and, if depressed, printed out the contents of the stack. This enabled me to find the code that was interrupted by the ISR and therefore where the mainline was looping. _________________ Regards, Andrew
http://www.brushelectronics.com/software
Home of Ethernet, SD card and Encrypted Serial Bootloaders for PICs!! |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|