Strange clock issue

Posted on
  • I have a number of Pico based units running 24x7 out in the field and have observed a few times in the last few months the RTC suddenly seeming to speed up.

    I finally managed to catch this while logging the serial console on a unit - the sequence was:

    • first sign of issues was the read process (a write verify) from the SPI EEPROM failed (read all zeros or FF's)
    • Because the clock was going too fast, this was generating timing errors in my code, which I was logging (write to EEPROM - which did appear to work!)
    • the Pico then started generating memory errors during SIM800 upload proess and restarted
    • on restart the unit uploaded to my server the logged error events, which had advanced timestamps on them
    • all the timestamps came from the standard espruino Clock module (based on getTime())
    • the restart rectified the problem

    The whole crash/restart process took around 30seconds (this is based on external timestamps coming from a SIM800), however my logged events (based on getClock()) over the 30second period went in a range of 18mins to 15hours into the future !??

    This particular unit is running on LSI, but I am reasonably sure it has happened on a Pico we put on an external crystal (to be confirmed)

    Is this even possible ??? It just seems that the RTC just suddenly started running around 1800x faster.

    Any ideas appreciated - it would be nice to be able to detect and fix this rather than just crash.

  • Wow, that is a strange one - which firmware version are you using?

    Do you ever use setTime, or do you just let the Pico's timer run?

    The Pico (and all the STM32-based Espruino boards - other non-Espruino boards don't) run with a software timer based off the system cycle counter, but synchronised to the RTC (to allow high accuracy timing for). If the RTC changed unexpectedly I guess it's possible that the timer could suddenly speed up massively to try and smoothly update to the new RTC time. It's great that you've been able to log it though.

    Do you know exactly which RTC times it happened between? Were they similar times for each unit? If it turns out the timestamp was at something that might have caused an overflow, it might help to track down what happened.

    Potentially I can do you a build for the Pico without the software timer, and we can see if that makes any difference for you?

    It's a pretty odd one though - I've been running a Pico working as a timer here for well over a year non-stop, and haven't hit any issues - but then maybe I'm not doing anything as time-critical as you are.

  • Yep, pretty strange.

    • No I never use setTime() as I was worried about causing an issue like this :)
    • I have observed on 1.87 and 1.92
    • Unfortunately I don't know what the getTime() value was at the time of the crash as I was not logging that to the console. Working backwards it would have been around 949245868 at restart time after the crash.

    To keep absolute time, my code regularly queries the local time from the SIM800, then each time I instantiate a new Clock object. This then keeps things accurate for Pico we have not yet put on the crystal.

    The clock speeding up would explain why my EEPROM read failed at that time also. In my code I do a EEPROM write and then immediately follow with a read to verify. In the AT25 code there is a loop using getTime() to wait for the EEPROM write period. So if this is essentially skipped then the read would fail.

    We never generally set the clock (using setTime) so in all our units it would be pretty random - unless it gets set by the IDE at programming time ?

  • If you've checked the option in the IDE, whenever you upload the RTC would be set (but it isn't the default).

    How accurate do you think the 949245868 figure would be? Accurate to a few seconds, or could it be several days out? It seems to be Sun Jan 30 2000 15:24:42 which doesn't seem to be near anything very specific.

    Are you using setDeepSleep? And do you sleep for very long periods of time? I guess it's possible there is some problem around the RTC which I haven't picked up, as when I've run boards for very long periods of time it's generally been when they've been network or USB-connected all the time, so power consumption wasn't an issue.

    I'm not sure what to suggest... I can try setting a board up to detect if/when the RTC glitches (doing a setWatch on a PWM output should work) and log the output, so I can see if there are any problems - however that's likely to take a long time to show anything up.

    In the mean time I could try and do you a build without the fancy RTC - in which case setWatch's timestamps would only be accurate to around 32kHz... But if that'll work for you it might be a good solution?

  • Ok, If you try this build:Ā­/espruino_1v93.16_pico_1r3_lores_rtc.binĀ­

    I'd be interested to see if you still have the problems - it's identical to the normal one apart from the software RTC follower, which has been removed.

  • Great thanks for that. I will run this on some units to see if it makes a difference - might take a while to determine if there is a difference compared to standard firmware. I will also try to add some logging to capture the RTC time if I can detect it happens again.

    I do use deepsleep, but only when the power is disconnected, which should not have been the case. When in deepsleep it will be waking up every 4s, so not asleep long.

  • Interestingly I just checked my IDE settings and I do have the 'Set current time' option turned on.

    So that makes the Sun Jan 30 2000 15:24:42 time that was in the unit after crash seem not right ? As I am pretty sure the unit was never powered down after the last software was loaded onto the pico.

    I had loaded the new software 10 days earlier in this case.

  • Wow ok, now that is strange. I wonder whether there's been some kind of overflow somewhere in the RTC - if it had suddenly jumped back in time then that would probably explain the software follower's speed changing drastically.

    I don't suppose you ever logged getTime() anywhere before this happened?

    Having looked into the code, I have a theory as to what might have happened. On the STM32F4 there's a proper hardware RTC which includes day/month/year - which makes getting the actual timestamp out of it kind of painful.

    It looks like I hadn't taken account of the fact that the month is 0-based in Espruino's Date implementation, but 1-based in the RTC itself. It means that 'January' is actually 0, which is out of bounds, so if the date was Jan 30, that'd indicate some kind of RTC error.

    But even without that, if the date were moving between months and Espruino thought the current month had 31 days when the hardware thought it had 30, that'd cause a massive jump.

    I've fixed this now, so the latest travis builds will hopefully be better. I'm amazed this never got spotted before though - I guess the F1 and nRF52-based parts don't have the issue because their RTC is just a counter, without the whole calendar part in it.

  • Unfortunately I have no logs of getTime(). However on my debug unit, where the IDE 'Set current time' option was turned on - it crashed after 10 days on 31/07/2017 at 5:18pm (NZ time)

    In my application, I resync the time to the SIM800 time (new Clock object) every 90s. So it is possible that my app would only crash if the clock step back happened when my code was actually doing something that relies on getTime() before the next SIM800 re-sync.

  • I think that must be it - 31/07/2017 is the last day of the month again, and June has 30 days, July has 31. Probably when the day flips over it starts running as quick as it can to try and catch up, but even so it might take a few hours of fast-running.

    Sorry about that - I don't know how that slipped through, or why it hasn't caused issues with the Picos that I've left running for long periods. The latest travis builds should have the fix for it in though, if you want to try those?

  • Its interesting, now I know it is a monthly issue, I can see strange things happening to our units on a monthly basis - although I cant exactly explain what is going on ie I would expect my software to crash completely, there definitely seems to be a correlation.

    We need to put some more units out into the field in a few days or so time. Which build would you recommend we put on (given we don't have a months testing time) - the custom one you created or a travis build ?

  • I'd go for the latest Travis build - if you can I'd still test for a day or two with it, just to make sure - but it should be significantly better!

  • Post a reply
    • Bold
    • Italics
    • Link
    • Image
    • List
    • Quote
    • code
    • Preview

Strange clock issue

Posted by Avatar for jonreid @jonreid