• Hello,

    I'm calling Bangle.http(...) from the Bangle every 1 second. After the memory leak on android has been resolved, it works much more stable now, however, I regularly run into another issue:

    The Bangle sometimes runs into a strange state, where Bangle.http(...) systematically raises the "Timeout" exception, although the https server definitively never times out and returns immediately. The bangle seems to get into this state especially, when the bangle and the android phone automatically reconnect after a connectivity loss. It is a spurious problem and in most automatic reconnects, all is fine. Whenever the system is in the erroneous state, a manual disconnect and then connect from within gadgetbridge (global android Bluetooth can remain enabled during the reconnect) always fixes the issue for some time.

    My suspicion is that after the automatic reconnect, some scrambling on the Bluetooth UART link occurs occasionally. Maybe there is something, which is reinitialized when manually reconnecting, but not when automatically reconnecting?

    I have looked at:

    adb logcat|grep gadgetbridge|grep UART
    

    Whenever all is fine, I see repeating (personal data removed from the request and the response):

    06-10 01:59:31.378 28485 28503 I nodomain.freeyourgadget.gadgetbridge.ser­vice.devices.banglejs.BangleJSDeviceSupp­ort: UART RX LINE: {"t":"http","url":"https://...","id":"79­638536497"}
    06-10 01:59:31.386 28485 28503 I nodomain.freeyourgadget.gadgetbridge.ser­vice.devices.banglejs.BangleJSDeviceSupp­ort: UART RX JSON parsed successfully
    06-10 01:59:31.508 28485 28485 I nodomain.freeyourgadget.gadgetbridge.ser­vice.devices.banglejs.BangleJSDeviceSupp­ort: UART TX: GB({t:"http",id:"79638536497",resp:"{...­)
    06-10 01:59:32.351 28485 28502 I nodomain.freeyourgadget.gadgetbridge.ser­vice.devices.banglejs.BangleJSDeviceSupp­ort: UART RX LINE: 
    06-10 01:59:32.382 28485 31310 I 
    

    When it is broken, I see this kind of line regularly:

    05-25 22:09:43.726 11962 12934 I nodomain.freeyourgadget.gadgetbridge.ser­vice.devices.banglejs.BangleJSDeviceSupp­ort: UART RX line started with 13 - ignoring
    

    The 13 seems to be random, it changes. The personal data reported in the RX/TX debug lines is still there, but it looks incomplete and random parts are missing.

    Thank you in advance for letting me know what you think.

  • That's odd - char code 13 is carriage return, which could sometimes be sent for a new line and might make some sense, but you say it can be other chars? Do you have any other examples of the ones that are sent?

  • Hello Gordon,

    thanks a lot for your reply!

    Yes, there can be others, here is a list of the ones which I have seen so far: 13,32,61,62,105

  • Thanks - interestingly that's:

    • 13 carriage return
    • 32 space
    • 61 =
    • 62 >
    • 105 i

    So not entirely random - > and = usually happen when a command has been sent to the Bangle without the starting \x10 character which tells it to 'be quiet' - so it's like if you were typing a command on the REPL in the IDE.

    How big is the webpage you're requesting? It could be it's filled the input buffer for Bangle.js and there's been some kind of overflow - usually there should be flow control but maybe something happened for that connection and it wasn't working...

    Also, just to check - are you running every 1 sec to try and stress test this, or is that what you expect to do normally? Doing one HTTP request a second is going to destroy your battery

  • Thanks - that is really interesting and a very important insight that these chars actually have a meaning.

    The webpage is a json string from my internal home automation system and contains <2000 ascii chars.

    For better reproducibility and to check, whether the size plays a role, I'm now making a test with the "Hello World!" test server https://pur3.co.uk/hello.txt and I'll report whether the issue also occurs with this very short string.

    I have already realized that it's a battery issue, my battery life is around 2 days with 1 request/sec. I'll leave it this way for stress testing, but you are right and I should decrease the rate in favor of battery life at some point...

  • Just a short update:

    Within the error handler, I had been calling Console.log(... and very likely, the ignored characters were coming from that and your suspicion that they are IDE characters makes a lot of sense. The characters actually have no direct relation to the issue of running into the timeout error with no https server timeout. I have removed the Console.log now.

    After removing that, it's not only that the IDE characters are gone, I surprisingly also observe that the overall stability improved a lot and the timeout issue happens more rarely now. It still does happen though, but it is much harder to reproduce now.

    Does it make sense that printing debug output may somehow confuse the communication with gadgetbridge and one should just avoid printing any human readable stuff there?

  • Ahh - something like this just came up in https://forum.espruino.com/conversations­/387533/#comment17018579

    But basically if you're using the normal Android Integration for HTTP then you should be ok, since that sends a newline before anything it sends back.

    The other issue is if you're sending a lot of data to the console then that could cause issues too, because Bangle.js will pause while waiting for that data to be sent (if you send enough the output buffer gets full) and then it won't be able to process the data coming in

  • Thanks for the link! That is interesting information.

    However, I have indeed always been using the official Android Integration, which should handle this automatically. Here are some new insights, which have a learned in the meantime:

    • The strange characters are definitively gone after removing Console.log(). The stability improved, but the issue definitively still occurs.
    • The timeout issue can also be reproduced with just the Hello World server, calling it once per second.
    • I have been calling E.getErrorFlags() regularly to see if an error gets set at the time when the issue occurs. This was not the case. Is it correct that in this way I can exclude that I get issues due to too much data?
    • When the timeout issue on the Bangle is present, I can still see in adb that all requests are properly executed by gadgetbridge and the the responses are in the adb logs too.
    • While the timeout issue is present, I have made an interesting reproducible observation: When I press the "Find Bangle" button in gadgetbridge, nothing happens. On the contrary, when I activate "Find phone" on the Bangle, the phone rings. Under normal conditions, both ways work, but with the issue being present, the direction gadgetbridge->Bangle is not working.

    The last observation makes me think, whether it is possible that in my case the UART is sometimes somehow down completely, but only in one direction. Does this make any sense? The bangle goes to timeout because it never gets the response, although gadgetbridge has the response and thinks that it is sending it. But I'm not sure how to track this down further.

  • I have been calling E.getErrorFlags() regularly to see if an error gets set at the time when the issue occurs. This was not the case. Is it correct that in this way I can exclude that I get issues due to too much data?

    Yes, I think that's fine then.

    While the timeout issue is present, I have made an interesting reproducible observation:

    Thanks, that is interesting - so as you say it looks a bit like Gadgetbridge isn't able to send any information to the Bangle. In your Gadgetbridge log, do you see any messages about XON or XOFF?

    It feels like there are two possibilities:

    • When connecting, the 'write' characteristic can't be found, and so Gadgetbridge can't write at all
    • Somehow the flow control system has got stuck and it believes the Bangle doesn't have any space for more data (so it's queuing up all the writes).

    You mentioned this happened after an automatic reconnect. Do you think it's safe to say that after that reconnect it doesn't work at all - or that it was working and then it broke.

  • Hi Gordon,

    thanks for your quick reply!

    I will connect it again to adb and watch out for XON/XOFF or something about the write characteristic. I'll let you know.

    Yes, after about 20% of automatic reconnects it doesn't work at all and can only be restored by a manual reconnect. Sometimes it is necessary to wait for 30s in the disconnected state before reconnecting and sometimes gadgetbridge even needs to be restarted. If the issue appears after a reconnect, the timeouts start to happen immediately after the reconnect, there is no time in between, where it works.

  • Can you confirm that the bug exists on other phone/device?

  • I had checked already with another device and I have seen a similar behaviour. These are the devices which I am using:

    • Blackview A80
    • Samsung Galaxy A22

    What I haven't checked on the Samsung yet, is whether the find bangle is also not working. I'll let you know.

  • I think, I have made an interesting observation:

    I'm greping now for XON, XOFF, UART and 'characteristic write' in adb logcat.

    When everything is working fine, I can see many lines of 'characteristic write' writing to the UART. In between that, I see some flow control messages sending XOFF and then XON quite quickly after that.

    So far, the timeout issue has reproduced itself only once while adb was connected. In that case, I saw XOFF being sent as one of the last messages with no XON following. The lines 'characteristic write' stop after that, the Bangle goes to timeout, but the http requests are still handled and logged by gadgetbridge.

    I'm wondering whether the issue is that in case that a bluetooth connection loss falls exactly into the time between XOFF and XON, the XON gets lost and this is what freezes the communication gadgetbridge -> bangle. The XON is possibly never repeated. The larger the data volume, the more flow control kicks in and the more likely it is to run into this issue. Maybe a possible solution would be to send XON automatically on every fresh bluetooth (re-)connect?

    Another observation: I'm printing the exception message of Bangle.http(...) to the display and when the Bluetooth goes out of range, I always get a "Timeout" exception first, which changes after a few seconds to "Bluetooth not connected". So it seems that the Bangle needs some seconds to understand that the Bluetooth connection has been fully lost, it doesn't realize it instantaneously. I'm wondering whether this could be the moment, during which XON gets lost.

  • a bluetooth connection loss falls exactly into the time between XOFF and XON

    Well, it could be, but Gadgetbridge should automatically turn flow control back on when it disconnects, connects, or attempts a reconnect - for example: https://codeberg.org/Freeyourgadget/Gadg­etbridge/src/branch/master/app/src/main/­java/nodomain/freeyourgadget/gadgetbridg­e/service/btle/BtLEQueue.java#L312-L315

    Do you see a handleDisconnected in your log? And where is it relative to XON/XOFF? It might be handy if I could see your log file.

    The only thing I can think is that maybe there was a Bluetooth packet that was received with XOFF in it, then the Bangle disconnected, and then the packet was processed after the Bangle reconnected?

    Maybe a possible solution would be to send XON automatically on every fresh bluetooth (re-)connect?

    As above, I'd have hoped Gadgetbridge did that automatically, but...

    That would be an easy one to try actually, you could just edit the connect event at https://github.com/espruino/BangleApps/b­lob/master/apps/android/boot.js#L253 to send Bluetooth.print("\x11"); and see if that helps.

    So it seems that the Bangle needs some seconds to understand that the Bluetooth connection has been fully lost

    Yes, that is the case - but I think that's not a problem... The issue isn't with Bangle.js, it's on the Gadgetbridge side by the look of it.

    One potential option is to ensure that Gadgetbridge doesn't even pause transmission forever. In the code I linked mPauseTransaction is a boolean, but we could just make it a counter so when XOFF is received, we pause transmission for maybe 2 seconds.

    It might lead to some data loss when Bangle.js was busy for a very long time (compacting for example), but in the majority of cases that'd probably a lot better than the alternative of it locking up.

  • Post a reply
    • Bold
    • Italics
    • Link
    • Image
    • List
    • Quote
    • code
    • Preview
About

Occasional scrambling of data on Bluetooth UART after automatic reconnects?

Posted by Avatar for user155558 @user155558

Actions