[SOLVED] 4.140 doing weird stuff (dsPIC)

newguy · Joined: 24 Jun 2004 Posts: 1908

I started development on a whole new family of products (to my company) about 6 months ago (ver 4.134ish I believe). I normally never use the latest compiler release (learned my lesson years ago), but 4.099 didn't support the particular dsPIC I'm using so I didn't have a choice.

At this point there's 2 major products I'm co-developing, but there will be more. The biggest feature of these things is an "endpoint-CAN" architecture to replace a veritable spider's web CAN architecture on our old product. What I mean by "endpoint-CAN" is that each device on the network will be a CAN endpoint for two CAN streams (by virtue of the two CAN transceivers that some dsPICs have). The primary reason for this is to enable easy field troubleshooting by unskilled personnel. With the old spider's web (weird interconnection matrix, single CAN network), a single bad cable could bring down the entire network and finding which cable would sometimes take days. With the new simplified architecture, each board(s) in the system report a bad CAN stream by flashing a LED, making it very easy for someone to find which cable is bad. Bad cables are our primary fault (something like 95%+ of cases because they experience a lot of wear & tear and physical jostling/movement).

I aimed my attention at one product first, and got that working satisfactorily relatively quickly. Its CAN performance is without fault and perfectly reliable.

Once that was done, I simply cut & pasted that CAN code over to the next project since they both share the same CAN architecture and data flow. The problem with the second project is (hopefully was) that it will lose a CAN stream periodically. Only one stream fails, but the other continues to work.

I've worked on this issue for almost a month and although I did find minor problems with the CAN code, I've not been able to stop this CAN hanging issue. Again, the weird thing is that the CAN code is 100% identical to that running in a different product and that product doesn't exhibit this behaviour. I've developed over 20 different products that utilize CAN communication and I've never seen a CAN bus "hang" ever before.

I've incorporated micro SD card slots into these new boards for error log generation/saving for cases just like this. Yesterday I enabled an error log function in the misbehaving board (prior to yesterday it was a part of the project but wasn't actually called in code) and added a couple of variables to enable the detection of a lost CAN stream which would trigger a dump of a bunch of diagnostic information to the SD card. Since I've loaded this problematic board with this new firmware, the thing hasn't hung yet and it's been almost 24 hours. It would hang within an hour before.

I'm pretty sure of the answer already but I would just like affirmation from others: this is smelling suspiciously like a compiler issue. What annoys me is that even though I don't require this particular error log feature on this particular product, it certainly appears as though I have to leave it in in order to get this thing to work.

gpsmikey · Joined: 16 Nov 2010 Posts: 588 Location: Kirkland, WA

I would assume you have already been down this path, but have you checked the errata sheet for that chip? Works on one but not the other could easily be one of those "details in the sand" things and Microchip has had some quirky errata before on some of the chips.

mikey
_________________
mikey
-- you can't have too many gadgets or too much disk space !
old engineering saying: 1+1 = 3 for sufficiently large values of 1 or small values of 3

newguy · Joined: 24 Jun 2004 Posts: 1908

I've been through the errata many times and nothing covers this. It is looking like a compiler issue (memory corruption most likely), but the program is so large that it's very difficult to isolate the exact cause.

Same code, same processor, two different projects - one flawless, one buggy. I should say was buggy as it's now about 26 hours of operation with no hangs so far. Given that it had difficulty going an hour without hanging before, I'm starting to be hopeful that I've somehow eliminated the problem but I hate "act of god" bug eliminations! I'd much rather know what was causing the problem in the first place so that I can implement a proper fix.

Reminds me of an issue I stumbled upon a couple years back (different processor, different compiler). Code that didn't have any issues didn't work (processor would go into a POR-init-POR-init... infinite loop). Simply declaring (but not using) one variable suddenly made everything right. After struggling with that issue for days, I really had no choice but to say "thank you" and turn my back on it because I really couldn't afford to waste any more time on it.

I'm at the same point with this thing.

yerpa · Joined: 19 Feb 2004 Posts: 58 Location: Wisconsin

Could be stack usage varies with compiler version, or the compiler command line switches might be different between the two versions. Possibly an array pointer overruns its allocated memory area and gets into space that was previously harmless, but now causes errors because of different boundaries of memory space. Check the two map files against each other.

gpsmikey · Joined: 16 Nov 2010 Posts: 588 Location: Kirkland, WA

Yep - I know what you mean about the "I fiddled with it and now it works" -- somehow, you just know murphy is lurking around the corner :-)

mikey
_________________
mikey
-- you can't have too many gadgets or too much disk space !
old engineering saying: 1+1 = 3 for sufficiently large values of 1 or small values of 3

Ttelmah · Joined: 11 Mar 2010 Posts: 19520

There are half a dozen things that might apply:

The first obvious one, is that there is something faulty in the device database for the new chip. CCS are very prone to problems here (see the recent thread from bkamen about Vref on some of the PIC24's). You may be seeing some particular register being incorrectly initialised, or possibly just moving the order of two initialisation components that access the same registers, or a fractional timing change here may have triggered the 'fix'.
The second equally obvious one is that the problem relates to variable placement, with either a 'hidden fault', accessing one byte beyond the end of an array for example, which only gives problems when the very next variable is being used for something that matters.
The third possibility is include files. Once 'classic', is to upgrade the compiler, and then compile a project, not realising the project file still pulls the include files from the older compiler. It only takes CCS to have tweaked an include, and this gives problems. However recompile on another day, not using the project file, and suddenly the correct include files are being used.
The fourth as yerpa says, is stack usage, though this would normally trigger an exception.
The next is 'non errata chip differences'. It'd be worth triple checking that things like input thresholds are the same on any pins you are using to receive data. It'd only take one chip to be using 0.7Vdd, and the other 0.8Vdd, for a Schottky threshold to trigger a change in reliability. This also applies to things like supply rail noise sensitivity (which does differ massively between some of the different PIC models). When you added the MicroSD slot, did you perhaps add a capacitor here?.
The final one is a micro timing issue. Though the SD code is not being used, maybe the extra time needed to test if it should be called is allowing something else to complete (possibly in the hardware). You should perhaps look at what is happening immediately in front of this test, or removing the test and adding a tiny delay.

The worst type of fault.....

Best Wishes

newguy · Joined: 24 Jun 2004 Posts: 1908

Thanks everyone. FWIW I've been exploring virtually all the suggested possible faults but nothing seems to fit.

For example, it's my standard practice to access all arrays in the same manner, by an index which is ANDed with a mask to prevent accessing memory past the bounds of the array. Don't think it's an array issue, particularly since the same code on a different board works.

TTelmah, the SD card slot existed on the PCB since the beginning. The only thing that changed was me actually putting in the "hooks" to potentially call a function that writes to the card. Electrically the PCB is identical. Actually both boards (the one that never gave me problems and the one that has been a "problem child") share a common power supply. This family of devices I'm working on have been designed as a stacked "brick" with one "universal" power supply motherboard which daughterboards then stack onto. The good and "bad" boards are actually stacked together.

What has me completely at a loss is how the CAN just stops. I know that if the TX errors get high enough that the CAN bus can revert to "BUS OFF" but when I inserted code to test for this (flash a debug LED), the CAN bus hung again but without the "TXBO (transmit bus off)" flag being set. The only method (valid method) to properly take a CAN transceiver offline that I'm aware of is to set the CAN bus into the config mode. This would fit the evidence, HOWEVER this takes TWO erroneous conditions: somehow accidentally setting C1CTRL1.win = 1 AND then setting C1CTRL1's requested op mode bits into config (or disable too I suppose). I could see one of these conditions happening accidentally but both? The odds simply don't favour it.

Init timing? Can't see it. Timing differences SD present vs not present are irrelevant. Further, I just can't see how a minute change in when in the power up cycle the CAN transceiver is initialized can cause a random shutdown of the transceiver hours later.

CAN traffic trigger? Explored this in great depth. Set up a test to flood the DUT with CAN traffic and also tried very light traffic. Both cases see approximately the same MTBF, so there is no direct link between number of CAN packets processed and the time to failure. There is no "magic" CAN packet that triggers the failure.

CCS compiler accessing wrong register: oh hell yes. Actually, with these two projects, for the very first time I had to resort to the "Microchip C30 way" of setting up the CAN, timers, etc. I'm directly accessing the registers (and I read the data sheet thoroughly - many times in fact) because the built-in functions (specifically the dsPIC specific CAN functions) weren't working at all. Is there a possibility that I'm doing something wrong? Doubt it, as the CAN code between a project that works and one that has been problematic is completely identical.

Hardware problem: I've examined every solder joint in the CAN chain under a microscope and I've reflowed the solder on every pin 3 times now. ESD damage? Doubt it, as this has all been taking place in an ESD safe lab. That doesn't eliminate the possibility but it's rather unlikely. Could the processor be slightly "wonky"? Yes, but again I rather doubt it given that a firmware change made the problem go away.

Yerpa: stack usage issue would cause a processor reset (exception) and I'm not seeing that. Also can't compare the variable map files directly because the two projects are radically different. Yes, the low level CAN functionality is identical but everything else is not. Like comparing an apple to an orange - they both have seeds but that's where the similarity ends.

Seems like all this is a moot point anyway as I'm at approx 45 hours of operation with no CAN failure now. Rolling Eyes

newguy · Joined: 24 Jun 2004 Posts: 1908

Update:

Finally got to the bottom of the problem 2 days ago. By the way, the SD card logging function "fix" wasn't a fix (that came to light around a week after I rolled it out).

For anyone else doing CAN bus work, here's what the problem actually was. I've been working with CAN products for almost 10 years and I've never encountered this before. Two different products/projects, but the same processor, crystal, etc. on each product. The CAN code was largely identical (other than specific CAN filters) on both projects.

What actually happened is that two CAN nodes would get into a perpetual start of transmission, loss of arbitration, wait, start of transmission, loss of arbitration, wait, ad infinitum. Essentially what was happening is that each CAN node would eventually get into a perfectly synchronized "try to talk at the same time - whoops - wait - try to talk at the same time - etc" loop.

System level crappy diagram:
other devices ----- CAN up bus <--> Project 1 <--> CAN down bus ----- CAN up bus <--> Project 2 <--> CAN down bus ----- other devices

The problem would manifest between Project 1's CAN down bus, which is externally connected to Project 2's CAN up bus.

As I mentioned, I've never seen this before. I was convinced the problem existed with a project's code and I didn't consider the possibility that the problem was with the interaction of two projects.

Ttelmah · Joined: 11 Mar 2010 Posts: 19520

On a particular network I designed sometime ago, we saw the possibility of such a race condition, and deliberately had the nodes generate a 'random' number based on their start-up time, which fractionally altered the retry interval they used. It is interesting that these devices didn't drift apart enough to fix this after a few retries.

Best Wishes