|
|
View previous topic :: View next topic |
Author |
Message |
terryopie
Joined: 13 Nov 2015 Posts: 13
|
SOLVED/Kinda - Random Resets with reason of MCLR_FROM_RUN... |
Posted: Fri Nov 13, 2015 8:46 am |
|
|
For reference I am using the following:
Compiler: PCWHD v4.135
Chip: PIC18F66K80
Memory usage: ROM=86% RAM=37% - 39%
FUSES:
Code: | #FUSES VREGSLEEP_SW,INTRC_HP,SOSC_DIG,HSM,PLLEN,NOFCMEN,IESO,PUT,NOBROWNOUT,BORV30,NOWDT,WDT1,CANE,MCLR,NOXINST,PROTECT,NOCPD,NOSTVREN,NODEBUG,NOCPB,NOWRT,NOWRTB,NOWRTC,NOWRTD,NOEBTR,NOEBTRB |
I recently made what should have been a minor change to the project in question. This project has been running on this chip for the last 3 years and has not had any issues.
After making the change I started experiencing random resets. When the reset happens for a given version of code, I can duplicate it nearly every time and on different boards. But, removing commented code, adding comments, or removing unused code all result in either the problem going away, or moving to a new part of the code. The part of the code where the reset/problem manifests has nothing to do with the minor change that was made.
I implemented a call to reset_cause() as the first line of main, and am displaying it to our display. From a normal power cycle, the value is 12 (NORMAL_POWER_UP). But in the instance where we get the unexpected reset the value is 15 (MCLR_FROM_RUN).
I am using a Saleae Logic Analyzer to monitor the MCLR pin. The pin does not drop out at anytime during this time. I have the sampling rate set to the highest that it supports, thus should have seen whether there was an issue. I did try disabling the MCLR fuse, but that didn't seem to have any affect. The MCLR pin is connected through a 10K resistor to pin 6 of a TC1232 watchdog chip.
Using the logic analyzer and a spare pin, I did narrow down the part of the code where the reset happened to occur. But there were no issues with that part of the code. The code in question is listed below... current and future lcd line arrays all have a size of 17. All other variables are unsigned int8.
Code: | case 1: // IN INSP? Y/N
case 5: // AT DOWN LIMIT Y/N
case 7: // AT FLR XX Y/N
case 11: // AT UP LIMIT Y/N
case 12: // 102? Y/N
if(current_lcd_line1[15]==0x20){ // Blank
if(setup_var)
future_lcd_line1[15] = "Y";
else
future_lcd_line1[15] = "N";
setup_var_blink_tmr = 75;
}
else{
future_lcd_line1[15] = 0x20;
setup_var_blink_tmr = 15;
}
break; |
I am at a loss as to what to try now. My minor change was to add some logic for checking an already existing value. At most I added 4 lines of code. I can't see that this is a software issue, but maybe it is uncovering a hidden issue? Too close to Max ROM usage on this chip?
I'm hoping that someone can point me in the right direction to get this solved. But as of right now, because of how seemingly unrelated changes cause errors in other unrelated parts of the code, I don't trust the compiler anymore. I tried looking for change logs for version 4.1xx of the compilers, but can only find it for v5.xxx. Is it possible this is a compiler issue and fixed in a later version?
Any help is appreciated!
Last edited by terryopie on Tue Nov 24, 2015 7:52 am; edited 1 time in total |
|
|
wangine
Joined: 07 Jul 2009 Posts: 98 Location: Curtea de Arges, Romania
|
|
Posted: Fri Nov 13, 2015 9:18 am |
|
|
Exist several issues actually can do the random reset on your MCLR pin. First of all try to identify where is the true issue. Can be a a WTD chip, power source, PIC or compiler. Need to take step by step. Remove the watchdog chip and put hard MCLR up, with 1k resistor and also can put a 100nF close on MCLR, that just for test. If resets stop and chip run normally can be WDT chip or power source fault, remove the CAP and watch again, if chip run normally then power source is also ok, but is more good to change to see what happen. After can test the WDT chip separately, if all indicated as good remain code issue or a simply compiler mistake. In that moment without step by step is hard to find the mistake. |
|
|
Ttelmah
Joined: 11 Mar 2010 Posts: 19520
|
|
Posted: Fri Nov 13, 2015 9:24 am |
|
|
First thing why have you got NOSTVREN selected?. STVREN, is one that should always be selected, unless you add your own code to monitor the stack. I'd possibly not be surprised if you enabled this, that you got the flag saying that the reset was caused by a stack overflow.
Now 'MCLR_FROM_RUN', is distinguished by being the one you get if nothing else has reset the chip. So there has not been a watchdog, not been a power on reset, not been a brownout, etc.. All that is left is an MCLR reset from run, so this is what it reported. You will get this if (for instance) you jump to the first location in memory, or if an invalid value is popped from the stack, resulting in a return to the bottom of memory. This is why I suspect STVREN.
So look carefully at your stack handling. |
|
|
terryopie
Joined: 13 Nov 2015 Posts: 13
|
|
Posted: Fri Nov 13, 2015 10:58 am |
|
|
Quote: | First thing why have you got NOSTVREN selected?. STVREN, is one that should always be selected, unless you add your own code to monitor the stack. I'd possibly not be surprised if you enabled this, that you got the flag saying that the reset was caused by a stack overflow. |
The following are the only defined constants for restart_cause() that I find in the header file:
Code: | // Constants returned from RESTART_CAUSE() are:
#define WDT_TIMEOUT 7
#define MCLR_FROM_SLEEP 11
#define MCLR_FROM_RUN 15
#define NORMAL_POWER_UP 12
#define BROWNOUT_RESTART 14
#define WDT_FROM_SLEEP 3
#define RESET_INSTRUCTION 0 |
I did enable STVREN, but still got back the MCLR_FROM_RUN reason. Possibly because there isn't a define for stack overflow?
Since there wasn't I also dumped out the value of the STKPTR register to my display... It came up with a value of 0x40. This would indicate a stack underflow condition.
I understand an overflow... Get into a recursion loop that you can't get out of, or not take the full stack size into account before going too deep.
But what causes a stack underflow? How can it pop the stack pointer past "main"? Now I'm confused... I didn't add or modify any functions.
Ideas?
Thank you! |
|
|
guy
Joined: 21 Oct 2005 Posts: 297
|
|
Posted: Fri Nov 13, 2015 1:55 pm |
|
|
I had an issue just like you describe (stable code, minor change, resets). In my case it turned out to be a stack overflow in a printf() statement. These, when nested inside functions and dealing with floating point numbers (but not only FP) tend to cause resets. If this could be the case, try simplifying the printf() by making calculations before the printf and then only displaying the result, try avoiding floating point, etc.
Also avoid nested functions if the printf() is deeply nested.
On the PIC24 you can increase the stack.
I'm not sure that my explanation is 100% correct but these practices solved the problem in my case. |
|
|
Ttelmah
Joined: 11 Mar 2010 Posts: 19520
|
|
Posted: Sat Nov 14, 2015 3:33 am |
|
|
OK.
On the reset_cause, _you_ have to test for stack over/underflow. The bits for this are not part of RCON register, which is what 'restart_cause' actually reflects. With STVREN enabled, if you add:
Code: |
#bit STKFUL=getenv("bit:STKFUL")
#bit STKUNF=getenv("bit:STKUNF")
if (STKFUL)
//display or indicate somehow that you have a stack overflow
if (STKUNF)
//display or indicate that you had a stack underflow
|
All the RCON bits are 'undefined' if a stack error occurs.
The 'underflow' errors can be tested for without STVREN, but the overflow error can't.
Now are you running this in debug?. There is a little problem here, that the debugger steals two stack levels. So code that could actually run OK for real, then gives stack overflows....
What does the listing show for stack used?.
The classic thing that can cause a stack error other than just 'running out', is a GOTO. This is one reason they are 'discouraged'. If (for instance), you jump from a piece of code inside a function, where a return address is on the stack (sometime inside a switch statement in some cases for example), then the stack can be left 'out of balance'.
Also remember that if your code (for instance) uses one more stack level, then the actual fault can appear somewhere else, when this just happens to step over the edge.... |
|
|
asmallri
Joined: 12 Aug 2004 Posts: 1635 Location: Perth, Australia
|
Re: Random Resets with reason of MCLR_FROM_RUN... |
Posted: Sat Nov 14, 2015 7:03 am |
|
|
terryopie wrote: |
After making the change I started experiencing random resets. When the reset happens for a given version of code, I can duplicate it nearly every time and on different boards. |
Are the boards powered with their own power supply or are all boards being tested with a common test bench power supply? If you are performing this testing with a common test setup then check for problems in the test setup. insufficient power supply filtering, faulty power supply, insufficient current etc. _________________ Regards, Andrew
http://www.brushelectronics.com/software
Home of Ethernet, SD card and Encrypted Serial Bootloaders for PICs!! |
|
|
terryopie
Joined: 13 Nov 2015 Posts: 13
|
|
Posted: Mon Nov 16, 2015 7:44 am |
|
|
Quote: | On the reset_cause, _you_ have to test for stack over/underflow. The bits for this are not part of RCON register, which is what 'restart_cause' actually reflects. With STVREN enabled, if you add: |
I did enable STVREN. First thing in main, I am saving away STKPTR register. It is giving me a value of 0x40 (Underflow ).
Quote: | Now are you running this in debug?. There is a little problem here, that the debugger steals two stack levels. So code that could actually run OK for real, then gives stack overflows.... |
No, I am not running in debug.
Quote: | What does the listing show for stack used?. |
Listing shows stack usage here:
Code: |
ROM used: 56620 bytes (86%)
Largest free fragment is 8912
RAM used: 1366 (37%) at main() level
1423 (39%) worst case
Stack: 7 worst case (6 in main + 1 for interrupts) |
|
|
|
Ttelmah
Joined: 11 Mar 2010 Posts: 19520
|
|
Posted: Mon Nov 16, 2015 8:16 am |
|
|
_Underflow_. Very interesting.
Somehow you are executing a return from something that is not actually called, or popping a value from the stack.
Does the code use function pointers?. Classic is these being overwritten so the code jumps to an unexpected location in memory.
Goto as already mentioned.
Interrupt enabled without a handler present (effect depends on what other code is down there). |
|
|
terryopie
Joined: 13 Nov 2015 Posts: 13
|
|
Posted: Mon Nov 16, 2015 9:28 am |
|
|
Ttelmah wrote: | _Underflow_. Very interesting.
Somehow you are executing a return from something that is not actually called, or popping a value from the stack.
Does the code use function pointers?. Classic is these being overwritten so the code jumps to an unexpected location in memory.
Goto as already mentioned.
Interrupt enabled without a handler present (effect depends on what other code is down there). |
Not using function pointers anywhere. Only have one segment of inline assembly. No GOTO or CALL commands being used. I'll have to go back through and double check that all but the one interrupt that we are using are disabled. Unfortunately can't check that for a few days... I'll report back.
Thank you for the suggestions!! |
|
|
Ttelmah
Joined: 11 Mar 2010 Posts: 19520
|
|
Posted: Mon Nov 16, 2015 9:44 am |
|
|
One section of in-line assembly?.
Postable?.
Any write to STKPTR, could cause this.
Any instruction that accesses PCL, PCLATH, or PCLATU.
Any POP.
The first two could be the result of a memory pointer (or array access), that is accessing an address outside the array... |
|
|
terryopie
Joined: 13 Nov 2015 Posts: 13
|
|
Posted: Mon Nov 16, 2015 10:11 am |
|
|
Here is the Assembly:
Code: | #ASM
MOVF _a_lo,W ; Set-up address to write to
MOVWF EEADR
MOVF _a_hi,W
MOVWF EEADRH
MOVF _a_lo,W ; Set-up address to write to
MOVWF EEADR
MOVF _ee_data,W ; Set-up data to write
MOVWF EEDATA
BCF EECON1,7 ; Point to Data EEPROM Memory
BSF EECON1,2 ; Enable EEPROM Write
BCF INTCON,7 ; Disable interrupts globally
MOVLW 0x55 ; The next four lines are required to allow the write
MOVWF EECON2
MOVLW 0xAA
MOVWF EECON2
BSF EECON1,1 ; Set WR bit to begin write
BSF INTCON,7 ; Enable interrupts globally
#ENDASM
|
This snippet is how we write the internal EEPROM. Its somewhat faster than using the builtin interface. |
|
|
PCM programmer
Joined: 06 Sep 2003 Posts: 21708
|
|
Posted: Mon Nov 16, 2015 10:28 am |
|
|
I suspect that you are putting a RETURN instruction in the ASM code,
instead of letting the compiler handle the return by letting the function
proceed to the closing brace. Maybe you are not doing it in the posted
routine, but you may be doing it somewhere.
This would work, but sometimes the compiler won't do a CALL. It will do
a pseudo-call with a BRA to the routine, and the compiler inserts a BRA at
the end of the routine to jump back to the caller. There is no stack
involved. In this case, the insertion of RETURN is extremely ill advised.
If you thwart the compiler by inserting in your own RETURN in #asm,
you are sabotaging your own program. Absolutely marginal gains
are not worth going to assembly code. |
|
|
terryopie
Joined: 13 Nov 2015 Posts: 13
|
|
Posted: Mon Nov 16, 2015 10:36 am |
|
|
PCM programmer wrote: | I suspect that you are putting a RETURN instruction in the ASM code,
instead of letting the compiler handle the return by letting the function
proceed to the closing brace. Maybe you are not doing it in the posted
routine, but you may be doing it somewhere.
This would work, but sometimes the compiler won't do a CALL. It will do
a pseudo-call with a BRA to the routine, and the compiler inserts a BRA at
the end of the routine to jump back to the caller. There is no stack
involved. In this case, the insertion of RETURN is extremely ill advised.
If you thwart the compiler by inserting in your own RETURN in #asm,
you are sabotaging your own program. Absolutely marginal gains
are not worth going to assembly code. |
The only assembly is what is listed in my above reply... No return that I can see would be added from that. Correct me if I am wrong. |
|
|
Ttelmah
Joined: 11 Mar 2010 Posts: 19520
|
|
Posted: Mon Nov 16, 2015 11:16 am |
|
|
There are several instructions missing from the posted assembler. After the GIE, you should clear the WREN bit. If this is not done, later table accesses can result in writes to the memory....
Then before initiating the write, you must clear EEPGD, and CFGS bits, and set the WREN bit. As written it could fail to write completely (it the WREN bit is not set), and could write to the program memory, instead of the EEPROM.
Look at the listing in the data sheet. |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|