CODE AND STORAGE OVERLAP while building firmware for NRF52

Posted on
Page
of 2
Prev
/ 2
  • Seems to be online, but for example selecting the MDBT42Q just hangs there....
    http://www.jumware.eu:88/efeu

  • tried today and it was slow but produced something with MDBT42Q config, however this board has 2600 variables by default and when clicking in various tabs I saw VARIABLES: 0 somewhere.

  • I tried to mess with location and size of variables and currently I have at least small improvement that they are automatically sized in free space between end of heap and beginning of stack. Linker provides following symbols: __HeapStart,__HeapLimit,__StackLimit so if stack is sized properly variables can be between __HeapLimit and __StackLimit (heap is in fact zero so __HeapStart =__HeapLimit but if we'd like to have some heap in future HeapLimit works better).

    I have added stack painting with DEADBEEF to nrf startup and initially set stack size to 0x2600 which gives similar size of variables than before (2650) but it looks that when running my typical code in DS-D6 the stack could be much less.
    @Gordon would you be interested to go in this direction? sizing just stack and autosizing variables according to that? And BTW I had to move variable initialization sooner in main as they are probably used without proper jsvInit call but with static allocation it probably doesn't break.

    here is small patch to see the changes https://gist.github.com/fanoush/957202cd­f42d866c046e5ee243fb72e3 for SDK11 based build
    I am building with -DJSVAR_MALLOC -DJSVAR_MALLOC_AUTO

    Stack sizing and those symbols are used here https://github.com/espruino/Espruino/blo­b/master/targetlibs/nrf5x_12/components/­toolchain/gcc/gcc_startup_nrf52.S#L39 and I have put stack painting here after clearing BSS https://github.com/espruino/Espruino/blo­b/master/targetlibs/nrf5x_12/components/­toolchain/gcc/gcc_startup_nrf52.S#L397 with loop taken from this https://devzone.nordicsemi.com/f/nordic-­q-a/3650/nrf51822---exception-on-stack-o­verflow/13253#13253

    and BTW when unzipping Nordic sdk11 zip the file gcc_startup_nrf52.s was named with lower case .s suffix and looks like when it does not end with .S the preprocessor run is skipped so at first I had a lot of head scratching why -D__STACK_SIZE=x had no effect.

    Also it looks like default stack size is 4096 in SDK11 and was raised to 8192 in SDK12 so maybe 8KB 0x2000 should be enough? How is C stack used in JS interpreter with relation to nested js functions, does it use just JS variables or C stack too?

    BTW, Just checked stack in DS-D6 over bluetooth with my SDK11 build

    >for (i=0x20010000-0x2600;i<0x20010000;i+=4) { if (peek32(i)!=0xdeadbeef) break; } ; print(i.toString(16))
    2000f298
    

    so basically only 0x0D68=3432bytes of stack was ever used. Is there some specific espruino code I can try that uses lot of stack? (eg. regexes?)

  • That's interesting - it does look pretty tidy, although it doesn't solve the MTU issue unless you also swap the order of heap and stack? IIRC there's something (_bss?) that you can actually modify at runtime such that if someone tries to use malloc (which was done in a recent bit of code you posted I think) you don't end up allocating over the top of it.

    Honestly right now it's not a big priority to add something like that though....

    Did you find where variables are used without initialisation? That sounds like a pretty worrying bug.

    How is C stack used in JS interpreter with relation to nested js functions, does it use just JS variables or C stack too?

    It's a bit of both. Because it's recursive descent it actually uses the stack pretty heavily so you need quite a bit.

  • Thats just first step to move it out of static array. And worth it even alone since the variable size is just a guesswork now (?) and also depends on size of globals, i.e you add some globals over time (or only with some specific build options - like e.g. some display driver with some buffer) and need to think to shrink variables. With this they are sized automatically to fill all available space.

    The only guesswork that remains is stack, which can be checked/tuned by stack painting and monitoring its size. Is there some stack check now at runtime or it will silently overwrite variables now (and other random globals a bit earlier since in lst file jsVars array is somewhere between globals)? If you don't set stack size explicitly now at build time and compensate for it by shrinking variables by trial and error, that's not ideal.

    As for resizing I agree this is not priority because of MTU allocation and is definitely separate step. In fact it would make more sense for stability and safety first - on stack side. It could try to e.g. shrink variables when stack grows over some safe threshold instead of crashing or corrupting data.

    Overall agree changing size at runtime is not trivial however this first step (= optimal static allocation outside of static variable array declaration) is easy and removes some uncertainty and guesswork.

    As for dynamic heap, that's between __HeapStart,__HeapLimit now and is zero (so default malloc should fail?). As for that long writes code using malloc I linked, I don't know why or if that worked, maybe the guy increased heap at build time with -D__HEAP_SIZE=, that would work for him and would work also with this patch (and shrink variables automatically).

    Was just asking to know if it makes sense to clean it up a bit and send PR for this. However it needs some first default stack size guess (possibly for each SDK,platform) - fortunately this feature can be gradually enabled per board.

    BTW, already found one bug/feature related to this - process.memory().stackEndAddress returns LINKER_END_VAR instead of __StackLimit which is the real end of stack. And with this patch without fix it currently points to beginning of variables, not end. So the fix is to change it to __StackLimit.

    However from the comments https://github.com/espruino/Espruino/blo­b/master/src/jswrap_process.c#L139 it looks like it is on purpose even if it really means end of globals and not end of stack. Not sure what was the reason to provide this feature exactly - some form of small and dangerous dynamic heap?

  • Did you find where variables are used without initialisation? That sounds like a pretty worrying bug.

    Perhaps yes - main runs jshInit before jsvInit https://github.com/espruino/Espruino/blo­b/master/targets/nrf5x/main.c#L21
    near end of jshInit is jsble_init https://github.com/espruino/Espruino/blo­b/master/targets/nrf5x/jshardware.c#L641­
    so I am guessing this
    https://github.com/espruino/Espruino/blo­b/master/targets/nrf5x/bluetooth.c#L2408­
    at least it hangs in HardFault_Handler after stepping over jsble_init()

  • Is there some stack check now at runtime

    There is, yes - basically whenever it recurses it checks, and tries to leave 256 bytes free.

    process.memory().stackEndAddress returns LINKER_END_VAR

    LINKER_END_VAR is end, which is realistically the end of the stack. Unless I'm wrong (which is very possible), since we have no malloc the stack runs from the top of memory, all the way down to the last allocated variable.

    __StackLimit is something specific to Nordic's SDK (and no other platforms) so I don't tend to use it. If I did, we'd just end up wasting RAM though.

    Not sure what was the reason to provide this feature exactly

    Someone wanted to store a bunch of data in 'free' RAM. IMO it should be removed - there are definitely more helpful things that could be added instead.

    variables are used without initialisation

    Thanks! I've just filed an issue for this: https://github.com/espruino/Espruino/iss­ues/1696

    That's a bug in the BLE code really - it's expecting to run after code has been loaded into the variables (so there's something there to read). Just moving the jsvar init earlier for that platform will fix the crash but the BLE code won't work because it's going to be looking at a bunch of empty JsVars.

    JsVar sizing

    ... I do like the idea of changing the amount of variables based maybe on a define for the amount of stack we want, but it'd be nice to have something that worked nicely multiplatform (at least on ARM). The original Espruino is based on STM32F103RCT6 (256k/48k), but in literally every board I've tested the chip was actually a STM32F103RDT6 (384k/64k). I never took advantage of that just in case, but it'd be great to have something that could take advantage of it at runtime.

    I'm a bit concered that Linux has a way of altering memory size, as does ESP32, and now this. And they're all IFDEF'd at the top of jsvInit. IMO it'd be nice to just pass in a pointer as well as size and handle it in the platform-specific main (although the Linux stuff would have to stay as that does some weird 'magic').

  • Back to the MTU topic, I was thinking how DFU update could be made faster and when reading about larger MTU and data length extensions and long writes and more packets per connection interval I am still confused how it works together and what could be done. Few questions - maybe someone has answers?

    If I keep DFU PACKET characteristics length in bootloader at 20, can longer MTU help me? Can BLE put more 20 byte write operations into one longer packet? Or it is strictly one ATT packet per BLE packet.

    If I increase the length of the characteristics (to e.g 256) will this be backward compatible when MTU is negotiated to be standard 23? Or am I supposed to change length of characteristics according to negotioated MTU, but that may not be even possible?

    In theory more packets per conenction interval can make it faster too according to Figure 6 in https://punchthrough.com/maximizing-ble-­throughput-part-3-data-length-extension-­dle-2/ however I don't know how to enable it in Bluez linux stack (or DFU bootloader) or how supported it is in general. I even don't know how to make connection interval shorter from linux side, that could help with current short packets too. Anyone knows?

    Asking because I got python DFU flasher working in linux on the Raspberry Pi via builtin BLE but the speed is 1.5KB/s so wondering how this can be improved.

    BTW the flasher is fork of https://github.com/dingari/ota-dfu-pytho­n - there is both legacy and secure dfu procedure implemented in pure python via scripting gatttool command in interactive mode :-) So no python BLE library is used at all. Pretty cool hack IMO. I have added bootloader/softdevice dfu modes,resuming when some notifications get occasionally lost in the middle of update and support for Desay bootloader used in DS-D6 and HX03W. It is surprisingly useful and stable for me now (especially when considering how it is done). Only the speed could be better.

  • just a followup, I was trying to improve DFU speed and at least with adafruit bootloader (where the MTU can be already negotiated at 247 without my changes) it was really matter of just increasing length of characteristics here from 20 to 244 and recompile. Then flashing 150KB zip with my python dfu flashing code took 32 seconds instead of 2 minutes and 6 seconds (when writing 244 bytes instead of 20). And it is even backward compatible with nrfconnect - at the slower speed, it still writes just 20 bytes to the characteristics.

    BTW is there any way to discover characteristics length remotely? I can't find it anywhere in nrfconnect, maybe that is not part of the information that can be discovered about services/characteristics?

  • If I keep DFU PACKET characteristics length in bootloader at 20, can longer MTU help me?

    don't think so :( As I understand it you can send/receive >1 packet per connection interval, but that can be done already without the increased MTU.

    If I increase the length of the characteristics (to e.g 256) will this be backward compatible when MTU is negotiated to be standard 23?

    I think in the bootloader it'll be fine. AFAIK In the firmware updater you'll have to check the MTU somehow and then sent the right size - maybe try it and see if it fails?

    In theory more packets per conenction interval can make it faster too

    Yeah, I'm not sure. I sort of assumed if you chucked a load of writes at it, it'd do what it could. Definitely on Espruino's side with the UART is does seem you can queue up more than one write per connection interval.

    I even don't know how to make connection interval shorter from linux side

    I don't know if you can, but it's trivial to do from the device side just by changing some connection parameters - see the NRF.setConnectionInterval implementation

    I got python DFU flasher working in linux on the Raspberry Pi via builtin BLE but the speed is 1.5KB/s so wondering how this can be improved.

    Nice! I didn't even realise that existed. I have a super hacky implementation based on the Node.js secure DFU implementation and I get 3kB/sec on linux now. I do that by issuing 8 write without responses at a time and then a CRC check (which I ignore the response of). The CRC check (which sends a notification) helps to 'sync' the writes - I guess you might be able to do something similar.

    I like the gatttool hack - makes me wonder if that couldn't be used to get a nice simple BLE implementation for Espruino linux builds.

    I was trying to improve DFU speed and at least with adafruit bootloader

    That's awesome! It'd be great to get that into the standard Espruino bootloader.

    So you still had to change your firmware uploader to use the larger packet sizes to take advantage of it, but otherwise that was it?

    BTW is there any way to discover characteristics length remotely?

    I don't know of anything, short of trying a write (with response?) and seeing if it fails?

    It seems that the connection starts off at 23 bytes and then there's a negotiation. At least on Android there's an event that is created when the MTU changes - but no idea how you'd see that with gatttool :)

  • So you still had to change your firmware uploader to use the larger packet sizes to take advantage of it, but otherwise that was it?

    Yes, exactly. However when I simply used MTU-3 for packet size it failed with unmodifed adafruit bootloader as it only accepts 20 bytes even when it negotiates MTU247. And I figure out that it doesn't work only after first notification that reports 0 bytes written so then I scale packet size down to 20 and restart from zero.

    So if it really cannot be determined remotely in advance I guess I need to flag it somewhere else.

    I will put up latest flasher code in the evening but here is that point in older version without this feature when I miss notification and get how many bytes are written and resume. So if I am still at zero offset there after sending first batch of long packets I now scale packet size back. But maybe I can call REPORT_RECEIVED_IMAGE_SIZE already after sending first packet.

    EDIT: I took just legacy DFU procedure into my desay/adafruit flasher but I hope same could be done for secure one too and eventually plan to try with SDK12 espruino bootloader too. BTW I now have pretty good understanding of legacy DFU procedure so next thing I'll try is adding legacy DFU to existing web bluetooth/javascript/node secure procedure you have there or the one which is here https://github.com/thegecko/web-bluetoot­h-dfu/

  • As for simple BLE implementation over gatttool, yes, everything is there, DFU procedure needs all basic features I guess and it works. pexpect python module makes it very easy, not sure what would be best in C for spawning process and pattern matching its output. So far gatttool just works when scripted like that however one peculiarity is that commands (like 'characteristics' that lists all of them) run asynchronously over prompt so it is a bit harder to match where the output ends as you get the prompt result [xxx][LE]> immediately and also each output line just overwrites the prompt so basically each line of output data is prefixed by prompt with some control and \r characters. So you cannot get all command output simply by waiting for next prompt and taking everything above it. So e.g. with 'characteristics' I simply have 3 seconds timeout to get the list and hope I got all of it. Other commands have specific string output pattern to match (receiving notification, MTU negotiation) so that is not an issue. Also you can miss something when waiting for two things - e.g. when pattern matching current command output and notification comes first instead. Anyway if there is easy pexpect like library for C it should be doable and not that hard.

  • I will put up latest flasher code in the evening

    changes here

    But maybe I can call REPORT_RECEIVED_IMAGE_SIZE already after sending first packet.

    That idea worked fine and is much faster than recovering later after sending 15 long packets and not receiving notification :-)

    I guess next time I can try to put secure DFU there too and try with SDK12 based espruino. As for bootloader changes hopefully it will be just changing this to 244 and possibly also NRF_BLE_MAX_MTU_SIZE definition few lines below to 247.

  • Nice - thanks! I'm not sure if there is a pexpect thing for C, but I imagine it wouldn't be too painful to code something up.

    However (this is a bit off-topic) I just looked at Bluetooth HCI in a bit more depth and it seems there's a standard protocol for implementing Bluetooth LE USB devices, and it's really not that painful. For example this for scanning: https://github.com/noble/node-bluetooth-­hci-socket/blob/master/examples/le-scan-­test.js

    It gives you super low level control, and would be multiplatform too (once the platform specific USB stuff was handled). I can imagine that actually the effort required to hack a BLE implementation with gatttool may not be that different to just talking direct to the BLE device.

    With legacy DFU on https://github.com/thegecko/web-bluetoot­h-dfu/, it was in there and got taken out in an earlier commit (you can find it in the commit history). I've got a super-hacky version of it that I'm using - and it really is super hacky, because again I'm doing the 'send 8 and wait for response' thing to speed up transfers.

  • Just a followup, I have recompiled espruino bootloader with larger MTU and after doing it almost right I found you already did it in your increased_mtu branch too :-)

    BTW here you increased ORIGIN but kept the length. However when fixing it it did not build as the data+heap overflowed so I had to build bootloader with -D__HEAP_SIZE=0, then it would link fine (by default it reserves 0x2000 for heap).

    Anyway, the bootloader still works after changes, so I modified the secure dfu method of python flasher to send larger data packets and it somehow works. The secure DFU design is more complicated than legacy one and is slower - with legacy you send all data and in the end there is one crc check. With secure there is additional level - instead of whole firmware you do the same but in 4096 (page size) chunks called objects which is crc checked and commited/executed separately. Anyway even with this overhead and 62 byte data packets there is speedup. I can do 2.8KB/s which is slower than legacy but still faster than before. Interesting is that when I write larger packets than 62 the bootloader completely freezes and device needs reset. So maybe there is some limit around 64 bytes but did not find it. The only limits I founds is code here and it should be enough. I also tried to increase chunk count from 4 to 8 there but it does not help. The interesting piece handling the packet is here and I don't see why p_req->req_len being over 62 would be a problem an any code below it. Maybe stack overflow or some BLE issue? I don't have debug output from bootloader, how NRF debug log work, can it do serial or only some segger specific tracing stuff? Anyway, this is just FYI. No need for you to spend extra time on debugging this.

    BTW, thanks for mentioning the removed legacy dfu web bluetooth implementation, found it, that saves some time :-)

    And as for "send 8 and wait for response" this is not hacky but normal(?) You initially send to bootloader how often you want notifications, with legacy you may even disable it completely. numbers over 10 are normal there. It it this and this but maybe you mean something different?

  • That's great, thanks! Yeah, I thought usually ORIGIN+LENGTH should have contained all of RAM - didn't realise that they added heap on top. It's surprising the build didn't fail for me too.

    62 byte data packets

    So the default would be 20 byte packets? Thanks for finding the limit - that must have taken a while :) Sounds like there's not really a reason not to increase the MTU for the bootloader.

    I never really 'got' the NRF debug stuff either, but I'm pretty sure it's expecting to use the UART.

    send 8 and wait for response

    Yeah, so in the old bootloader it was configurable, so setting to 8 was relatively sane.

    What I do in secure DFU is I ask for a CRC after every 8 packets in order to basically fake what the old bootloader did. I'm completely ignoring the result of the CRC - it's literally just used to allow me to push 8 writes via Noble as fast as possible and then use the request/response to sync it all back up :)

  • Yes default here is 20. My current numbers with getting exactly one notification per each 4096 based object is 1 minute and 18 seconds when sending 263228 bytes of application. When I remove explicit CRC check command before committing/executing each block I am down to 1 minute and 12 seconds (=3655 bytes/s). So 6 seconds for extra crc checks would not be so bad, however the notification also contains crc check so if I set it to receive one notification right after sending each whole 4096 block I get crc check for for free, without needing to send any explicit CALC_CHECKSUM command.

    my changes in python flasher for secure dfu procedure are here
    the output looks like this

    pi@raspberrypi:~/dsd6-ota-dfu-python $ ./dfu.py --secure -a CF:1E:35:40:14:92  -z espruino_2v04.187_dsd6_sdk12.zip 
    
        ================================
        ==                            ==
        ==         DFU Server         ==
        ==                            ==
        ================================
        
    Sending file espruino_2v04.187_dsd6_sdk12_app.bin to CF:1E:35:40:14:92
    Binary imge size: 263228
    Binary CRC32: 1958051001
    Connecting to CF:1E:35:40:14:92
    Checking DFU State...
    Init packet successfully transfered
    MTU: 158, packet size: 62
    Max object size: 4096, num objects: 65, offset: 0, total size: 263228
    Progress: |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx­xxxxxxxxxxx| 100.0% Complete (263228 of 263228 bytes)
    
    Upload complete in 1 minutes and 12 seconds
    DFU Server done
    
    

    And BTW, I finally understand relation between larger MTU and DLE - data length extension. They are not directly related. One can have larger MTU but no DLE. That is why iphone has MTU 158 or 185 so that it would fit in 6 or 7 classic 27 byte packets. So it is nice if there are larger packets via DLE and it fits in just one, but if DLE is not enabled, it is sent as set of packets hopefully in one connection interval.

    That's great, thanks! Yeah, I thought usually ORIGIN+LENGTH should have contained all of RAM - didn't realise that they added heap on top. It's surprising the build didn't fail for me too.

    It contains all of RAM, heap and stack is allocated inside it. You made RAM to be 0x9A48. But now when thinking about it, it worked fine because it was originally too small! It was below 0x8000=32KB before, but nrf52832 has 64KB which is 0x10000. In fact in normal espruino linker file here it is correct, 0x2c40+0xd3c0=0x10000. For bootloader it was previously 0x2C00+0x5380=0x7F80. So perhaps it was 32KB minus 0x80 space for bonding or some other info passed from application(?) Or is it intentional to have only 32KB for bootloader and leave the other 32KB for appliciation?

  • That's great!

    So perhaps it was 32KB minus 0x80 space for bonding

    Yeah, I think that's likely. I'm sure I heard something about that - it's for when you want buttonless DFU.

    It totally wasn't intentional on my part - but I guess it didn't really matter as the bootloader would never use the extra memory anyway.

  • Post a reply
    • Bold
    • Italics
    • Link
    • Image
    • List
    • Quote
    • code
    • Preview
About

CODE AND STORAGE OVERLAP while building firmware for NRF52

Posted by Avatar for binux @binux

Actions