View previous topic :: View next topic |
Author |
Message |
canadidan
Joined: 13 Feb 2019 Posts: 24
|
[PIC24HJ256GP210] Device ID becomes corrupted |
Posted: Wed Feb 13, 2019 10:36 am |
|
|
Research
When I search "device ID wrong", I find the following common causes:
• Issue with pulling MCLR low/high
• User trying to power PIC with programmer
• Other wiring issue with ICSP pins
These problems prevent people from programming the PIC. My problem is different.
My Issue
A perfectly working device will suddenly change identity - the device ID will change, but it will continue to function otherwise.
For this PIC, the device ID is 0x0073. After the corruption, it is 0x00FF, corresponding to a dsPIC33FJ256GP710.
Furthermore, if I tell my programmer the chip is a dsPIC33FJ256GP710 (not the PIC24HJ256GP210 is actually is) I can still load new software to the PIC's flash and run it (as shown above).
Cause
This has only occurred while running a custom "firmware upgrade" procedure: the new firmware is received over UART, stored in an external flash, the CRC is checked, then it is copied to the PIC flash using write_program_memory();
Frequency
I have run this firmware upgrade routine on over 40 units and in excess of 500 times. This corruption has occurred on a total of 5 units.
Of them, 2 units were corrupted the first time I ran the upgrade procedure. The other 3 had been upgraded dozens of times before eventually being corrupted (also during/after this upgrade procedure).
Steps tried so far to recover these devices
As shown above, I first tried reading the Device ID from the device itself following these instructions:
https://www.ccsinfo.com/forum/viewtopic.php?t=43278
Then, I tried overwriting this area of flash (using both ASM and C code methods), but that hasn't worked. I suspect those addresses are protected.
Has anyone seen this before, and how could I correct the device ID?
Thank you,
Dan
Last edited by canadidan on Wed Feb 13, 2019 2:24 pm; edited 2 times in total |
|
|
temtronic
Joined: 01 Jul 2010 Posts: 9278 Location: Greensville,Ontario
|
|
Posted: Wed Feb 13, 2019 10:55 am |
|
|
OK, I don't have that PIC ...
but
any chance the programming cable length is the 'random' issue ?
or..
any chance the 5 bad ones have the same date/batch code info ?? |
|
|
Ttelmah
Joined: 11 Mar 2010 Posts: 19601
|
|
Posted: Wed Feb 13, 2019 11:13 am |
|
|
Device ID's are 16bit values, not 8bit values.
0xFF is not the ID for the DsPIC33HJ256GP710. It's ID is 0x7FF.
0xFF, suggests the area has been erased. 0xFF is generically what the
memory erases to.
How are you reading this?. You talk about an on board programming
system. Is this being read internally from the chip or using a programmer?.
The device ID is protected, but can be destroyed if there is a power
spike during programming.
Normally a device ID failure with an external programmer is a connection
problem. Too much capacitance on a line a bad connection or incorrect
supplies. You'll get this is the Vcap signal is not being generated while
programming.
If the device ID is destroyed, it can't be rewritten. |
|
|
canadidan
Joined: 13 Feb 2019 Posts: 24
|
|
Posted: Wed Feb 13, 2019 12:55 pm |
|
|
@temtronic
We have 4 CCS programmers, with a variety of custom cables and a pogo-pin fixture for automated programming. Once the device ID is corrupted, it is repeatable across programmers.
I've checked 3 close to me, and the date/batch codes are varied:
2x 1811BTU
1x 1823JJ4
@Ttelmah
The on-board programming system does the following:
1. Erases a defined "application" region in flash:
Code: |
void BIOS_Erase_CPU_Flash_FMWR_Sector(){
unsigned int32 address_erase;
for(address_erase=0x00400;address_erase<BIOS_ADDR;address_erase+=(getenv("FLASH_ERASE_SIZE")/2))
erase_program_memory(address_erase);
}
|
2. Uses write_program_memory() to write to this same application region.
Reading Device ID
I have read the device ID 3 ways:
* Using CCSLOAD
* Using the assembly code from the post I linked in OP
* Using the read_program_memory() function
CCSLOAD
Assembly
Code: |
#asm
mov #0x00FF, W0
mov W0, TBLPAG
mov #0x0000, W1
tblrdl [W1],W0
mov W0, devid
tblrdh [W1],W0
mov W0, devid+2
mov #0x0002, W1
tblrdl [W1],W0
mov W0, rev
tblrdh [W1],W0
mov W0, rev+2
#endasm
printf("devid = 0x%LX, rev = 0x%LX\r\n", devid, rev);
|
C code
Code: |
unsigned int8 mem_buffer[2];
read_program_memory(0x00FF0000,mem_buffer,2);
unsigned int16 id = mem_buffer[0] | (mem_buffer << 8);
printf("devid = 0x%LX\r\n", id);
|
Voltages / Connections / VCap
This failure has never happened during ICSP programming so I'm not sure how it could be a programmer/VCap issue.
Also, these devices program successfully when I lie about the target device type.
Could erase_program_memory() somehow erase the Device ID too, despite protection? |
|
|
Ttelmah
Joined: 11 Mar 2010 Posts: 19601
|
|
Posted: Wed Feb 13, 2019 1:46 pm |
|
|
There is an issue with a very close member of the PIC family, where the
write 'stall' does not function correctly. It is vital that interrupts are
disabled during a program memory write, and on chips with this problem
the code should poll the WR bit to verify the write has completed.
Worth ensuring you have interrupts disabled, and add this check after
the write.
Devid 0x00FF is actually a dsPIC33FJ256GP710. FJ, not the HJ.
It should be impossible to write the DEVID except by damaging the
cells. |
|
|
canadidan
Joined: 13 Feb 2019 Posts: 24
|
|
Posted: Wed Feb 13, 2019 2:22 pm |
|
|
@Ttelmah
My mistake - couldn't remember if it was FJ or HJ.
Interrupts
Before it initiates the erase/write to flash, it disables interrupts:
Code: | Disable_Interrupts(INTR_GLOBAL); |
Should I do anything in addition to this?
Poll WR bit
This is good to know, and I will implement this! It may be difficult to test in the short term, but I will return with the long-term results.
Code: |
void BIOS_Erase_CPU_Flash_FMWR_Sector(){
unsigned int32 address_erase;
for(address_erase=0x00400;address_erase<BIOS_ADDR;address_erase+=(getenv("FLASH_ERASE_SIZE")/2))
{
erase_program_memory(address_erase);
while(bit_test(NVMCON, 15));
}
}
|
where
Code: |
#WORD NVMCON = 0x0760
|
Edit: I had the wrong address for NVMCON, I took 0x0728 from here: https://www.ccsinfo.com/forum/viewtopic.php?t=54366
But from the datasheet it is 0x0760. |
|
|
canadidan
Joined: 13 Feb 2019 Posts: 24
|
|
Posted: Fri Feb 15, 2019 8:18 am |
|
|
Here are the results of my testing, with the above changes:
Test setup
I created a "firmware upgrade stress test" routine - which continually runs our firmware upgrade method over UART until it fails.
Test without checking WR
The device ID was corrupted after 25 cycles (date code 1823JJ4)
Test with checking WR
The device ID was corrupted after 106 cycles (date code 1823JJ4)
What does this mean?
With such a low sample size, it doesn't mean very much - except that the fix wasn't sufficient. Maybe it delayed it, or maybe silicon variances just delayed the inevitable.
Next steps
I will modify the erase code to stop at the end address of the newly provided firmware - rather than erasing all of program space. It means old code might be left behind, but it is safer than wiping straight to the end and possible damaging the device ID.
I have a whole bag of previous-gen units, so I will continue investigating. |
|
|
newguy
Joined: 24 Jun 2004 Posts: 1912
|
|
Posted: Fri Feb 15, 2019 8:23 am |
|
|
Throw in a little extra delay after the WR bit falls. Curious if a bit of extra time makes any difference.
That said, I do think that your plan to only erase and rewrite memory to the end of the new image, instead of to the end of the program space, will probably get rid of the issue. |
|
|
temtronic
Joined: 01 Jul 2010 Posts: 9278 Location: Greensville,Ontario
|
|
Posted: Fri Feb 15, 2019 9:15 am |
|
|
Any chance it's a 'marginal' power supply issue ?
gremlins and gators are NOT fun.. |
|
|
canadidan
Joined: 13 Feb 2019 Posts: 24
|
|
Posted: Fri Feb 15, 2019 10:16 am |
|
|
@newguy
I will explore additional delay! Currently, I'm running a test with the erase routine completely bypassed, to see if it is even the source of the issue.
@temtronic
The 3V3 has been really good in our design - I've done MTBF testing for weeks straight, with noisy stepper drivers running constantly, without a single reset.
We have all decoupling caps, placed close to the pins.
Do you feel we should take additional steps to improve this? |
|
|
Ttelmah
Joined: 11 Mar 2010 Posts: 19601
|
|
Posted: Fri Feb 15, 2019 12:41 pm |
|
|
Are you _sure_ about the ESR of your Vcap capacitor?. This is a parameter
that can cause really silly errors. I had a whole 'batch' of similar chips that
destroyed their ID's when writing to the program memory as simulated
EEPROM. It turned out the people assembling the boards had used a
substitute for this capacitor. It appears that the write does impose some
exceptional spikes on this line.... |
|
|
canadidan
Joined: 13 Feb 2019 Posts: 24
|
|
Posted: Fri Feb 15, 2019 1:50 pm |
|
|
@Ttelmah
That's really interesting and helpful! I hadn't thought about that actually. This aspect of the design pre-dates my inheritance of the project.
From the datasheet:
We are using a CL21X106MOQNNNE from Samsung:
* 16V
* 10uF
* ESR below 1 Ohm
* Trace length approx. 2mm
You made me think that maybe the MLCC shortage had forced me to order alternates. After checking my order history, this is not the case. I've always ordered the same part. Our design has gone through 2 generations / 4 revisions, and each time I've ordered new batches of components. Surely QC isn't that bad.. but perhaps?
Since it is worth investigating, I'll order a variety of 4.7uF and 10uF caps from other vendors with low ESR and compare. |
|
|
Ttelmah
Joined: 11 Mar 2010 Posts: 19601
|
|
Posted: Fri Feb 15, 2019 1:57 pm |
|
|
Other thing is if the failing chips are all from one batch, you may
simply have faulty chips. Does happen... |
|
|
canadidan
Joined: 13 Feb 2019 Posts: 24
|
|
Posted: Fri Feb 15, 2019 2:22 pm |
|
|
Occam's razor, right..
Batch Date Codes
For units built since last year, they all have the same two date codes with equal failures from each:
3x 1811BTU / 3x 1823JJ4
Corruption has occurred in both.
I found some very early prototypes from 2014/2015 - I'll subject those chips to the same test and see.
Testing
Here is what has been tested so far:
* Original code: corrupts Device ID (25 cycles)
* Wait for WR: corrupts Device ID (106 cycles)
* Skip the erase: inconclusive (didn't fail after 96 cycles)
Here is what's to come:
* Try different VCap from other vendors
* Try original code on 2014 and 2015 batch ICs
Thanks to all of you for the help; I personally really appreciate it. |
|
|
dexta64
Joined: 19 Feb 2019 Posts: 11
|
|
Posted: Tue Feb 19, 2019 2:26 pm |
|
|
I cannot confirm the GND connections of capacitors C21, C22 and C24. They're not connected. Can you measure them?
If you have a switched regulator, ESR will be a serious problem for you.
Check the pcb of a hard disk. There are two engines running and doing the job at all times. Why design the right pcb.
Also, PIC24HJXXXGPX06 / X08 / X10 Family Silicon Errata and Data Sheet Clarification. Page 15.
"32. Module: Device ID Register On a few devices, the content of the Device ID register can change from the factory programmed default value immediately after RTSP or ICSP™ Flash programming.
As a result, development tools will not recognize these devices and will generate an error message indicating that the device ID and the device part
number do not match. Additionally, some peripherals will be reconfigured and will not function as described in the device data sheet" |
|
|
|