[PIC24HJ256GP210] Device ID becomes corrupted

canadidan · Joined: 13 Feb 2019 Posts: 24

Research

When I search "device ID wrong", I find the following common causes:

• Issue with pulling MCLR low/high
• User trying to power PIC with programmer
• Other wiring issue with ICSP pins

These problems prevent people from programming the PIC. My problem is different.

My Issue

A perfectly working device will suddenly change identity - the device ID will change, but it will continue to function otherwise.

For this PIC, the device ID is 0x0073. After the corruption, it is 0x00FF, corresponding to a dsPIC33FJ256GP710.

Furthermore, if I tell my programmer the chip is a dsPIC33FJ256GP710 (not the PIC24HJ256GP210 is actually is) I can still load new software to the PIC's flash and run it (as shown above).

Cause

This has only occurred while running a custom "firmware upgrade" procedure: the new firmware is received over UART, stored in an external flash, the CRC is checked, then it is copied to the PIC flash using write_program_memory();

Frequency

I have run this firmware upgrade routine on over 40 units and in excess of 500 times. This corruption has occurred on a total of 5 units.

Of them, 2 units were corrupted the first time I ran the upgrade procedure. The other 3 had been upgraded dozens of times before eventually being corrupted (also during/after this upgrade procedure).

Steps tried so far to recover these devices

As shown above, I first tried reading the Device ID from the device itself following these instructions:

https://www.ccsinfo.com/forum/viewtopic.php?t=43278

Then, I tried overwriting this area of flash (using both ASM and C code methods), but that hasn't worked. I suspect those addresses are protected.

Has anyone seen this before, and how could I correct the device ID?

Thank you,
Dan

temtronic · Posted: Wed Feb 13, 2019 10:55 am

OK, I don't have that PIC ...
but
any chance the programming cable length is the 'random' issue ?
or..
any chance the 5 bad ones have the same date/batch code info ??

Ttelmah · Joined: 11 Mar 2010 Posts: 19482

Device ID's are 16bit values, not 8bit values.

0xFF is not the ID for the DsPIC33HJ256GP710. It's ID is 0x7FF.
0xFF, suggests the area has been erased. 0xFF is generically what the
memory erases to.

How are you reading this?. You talk about an on board programming
system. Is this being read internally from the chip or using a programmer?.

The device ID is protected, but can be destroyed if there is a power
spike during programming.

Normally a device ID failure with an external programmer is a connection
problem. Too much capacitance on a line a bad connection or incorrect
supplies. You'll get this is the Vcap signal is not being generated while
programming.

If the device ID is destroyed, it can't be rewritten.

canadidan · Joined: 13 Feb 2019 Posts: 24

@temtronic

We have 4 CCS programmers, with a variety of custom cables and a pogo-pin fixture for automated programming. Once the device ID is corrupted, it is repeatable across programmers.

I've checked 3 close to me, and the date/batch codes are varied:
2x 1811BTU
1x 1823JJ4

@Ttelmah

The on-board programming system does the following:

1. Erases a defined "application" region in flash:

Ttelmah · Joined: 11 Mar 2010 Posts: 19482

There is an issue with a very close member of the PIC family, where the
write 'stall' does not function correctly. It is vital that interrupts are
disabled during a program memory write, and on chips with this problem
the code should poll the WR bit to verify the write has completed.
Worth ensuring you have interrupts disabled, and add this check after
the write.

Devid 0x00FF is actually a dsPIC33FJ256GP710. FJ, not the HJ.

It should be impossible to write the DEVID except by damaging the
cells.

canadidan · Joined: 13 Feb 2019 Posts: 24

@Ttelmah

My mistake - couldn't remember if it was FJ or HJ.

Interrupts

Before it initiates the erase/write to flash, it disables interrupts:

canadidan · Joined: 13 Feb 2019 Posts: 24

Here are the results of my testing, with the above changes:

Test setup

I created a "firmware upgrade stress test" routine - which continually runs our firmware upgrade method over UART until it fails.

Test without checking WR

The device ID was corrupted after 25 cycles (date code 1823JJ4)

Test with checking WR

The device ID was corrupted after 106 cycles (date code 1823JJ4)

What does this mean?

With such a low sample size, it doesn't mean very much - except that the fix wasn't sufficient. Maybe it delayed it, or maybe silicon variances just delayed the inevitable.

Next steps

I will modify the erase code to stop at the end address of the newly provided firmware - rather than erasing all of program space. It means old code might be left behind, but it is safer than wiping straight to the end and possible damaging the device ID.

I have a whole bag of previous-gen units, so I will continue investigating.

newguy · Joined: 24 Jun 2004 Posts: 1907

Throw in a little extra delay after the WR bit falls. Curious if a bit of extra time makes any difference.

That said, I do think that your plan to only erase and rewrite memory to the end of the new image, instead of to the end of the program space, will probably get rid of the issue.

temtronic · Posted: Fri Feb 15, 2019 9:15 am

Any chance it's a 'marginal' power supply issue ?
gremlins and gators are NOT fun..

canadidan · Joined: 13 Feb 2019 Posts: 24

@newguy

I will explore additional delay! Currently, I'm running a test with the erase routine completely bypassed, to see if it is even the source of the issue.

@temtronic

The 3V3 has been really good in our design - I've done MTBF testing for weeks straight, with noisy stepper drivers running constantly, without a single reset.

We have all decoupling caps, placed close to the pins.

Do you feel we should take additional steps to improve this?

Ttelmah · Joined: 11 Mar 2010 Posts: 19482

Are you _sure_ about the ESR of your Vcap capacitor?. This is a parameter
that can cause really silly errors. I had a whole 'batch' of similar chips that
destroyed their ID's when writing to the program memory as simulated
EEPROM. It turned out the people assembling the boards had used a
substitute for this capacitor. It appears that the write does impose some
exceptional spikes on this line....

canadidan · Joined: 13 Feb 2019 Posts: 24

@Ttelmah

That's really interesting and helpful! I hadn't thought about that actually. This aspect of the design pre-dates my inheritance of the project.

From the datasheet:

We are using a CL21X106MOQNNNE from Samsung:
* 16V
* 10uF
* ESR below 1 Ohm
* Trace length approx. 2mm

You made me think that maybe the MLCC shortage had forced me to order alternates. After checking my order history, this is not the case. I've always ordered the same part. Our design has gone through 2 generations / 4 revisions, and each time I've ordered new batches of components. Surely QC isn't that bad.. but perhaps?

Since it is worth investigating, I'll order a variety of 4.7uF and 10uF caps from other vendors with low ESR and compare.

Ttelmah · Joined: 11 Mar 2010 Posts: 19482

Other thing is if the failing chips are all from one batch, you may
simply have faulty chips. Does happen...

canadidan · Joined: 13 Feb 2019 Posts: 24

Occam's razor, right..

Batch Date Codes

For units built since last year, they all have the same two date codes with equal failures from each:

3x 1811BTU / 3x 1823JJ4

Corruption has occurred in both.

I found some very early prototypes from 2014/2015 - I'll subject those chips to the same test and see.

Testing

Here is what has been tested so far:

* Original code: corrupts Device ID (25 cycles)
* Wait for WR: corrupts Device ID (106 cycles)
* Skip the erase: inconclusive (didn't fail after 96 cycles)

Here is what's to come:

* Try different VCap from other vendors
* Try original code on 2014 and 2015 batch ICs

Thanks to all of you for the help; I personally really appreciate it.

dexta64 · Joined: 19 Feb 2019 Posts: 11

I cannot confirm the GND connections of capacitors C21, C22 and C24. They're not connected. Can you measure them?

If you have a switched regulator, ESR will be a serious problem for you.

Check the pcb of a hard disk. There are two engines running and doing the job at all times. Why design the right pcb.

Also, PIC24HJXXXGPX06 / X08 / X10 Family Silicon Errata and Data Sheet Clarification. Page 15.

"32. Module: Device ID Register On a few devices, the content of the Device ID register can change from the factory programmed default value immediately after RTSP or ICSP™ Flash programming.
As a result, development tools will not recognize these devices and will generate an error message indicating that the device ID and the device part
number do not match. Additionally, some peripherals will be reconfigured and will not function as described in the device data sheet"