You are reading a single comment by @Gordon and its replies. Click here to read the full conversation.
  • Hi,

    Honestly, I think the only way to really be sure with stuff like this is to benchmark (and also look at the disassembly (... make lst). I just looked at the disassembly of the existing spiFlashWrite:

    1. 00044f88 <spiFlashWrite>:
    2. static void spiFlashWrite(unsigned char *tx, unsigned int len) {
    3. 44f88: b5f0 push {r4, r5, r6, r7, lr}
    4. 44f8a: 4401 add r1, r0
    5. 44f8c: f04f 43a0 mov.w r3, #1342177280 ; 0x50000000
    6. 44f90: f04f 6400 mov.w r4, #134217728 ; 0x8000000
    7. 44f94: f44f 2500 mov.w r5, #524288 ; 0x80000
    8. int data = tx[i];
    9. 44f98: f810 6b01 ldrb.w r6, [r0], #1
    10. for (int bit=7;bit>=0;bit--) {
    11. 44f9c: 2207 movs r2, #7
    12. nrf_gpio_pin_write((uint32_t)pinInfo[SPIFLASH_PIN_MOSI].pin, (data>>bit)&1 );
    13. 44f9e: fa46 f702 asr.w r7, r6, r2
    14. if (value == 0)
    15. 44fa2: 07ff lsls r7, r7, #31
    16. p_reg->OUTCLR = clr_mask;
    17. 44fa4: bf54 ite pl
    18. 44fa6: f8c3 450c strpl.w r4, [r3, #1292] ; 0x50c
    19. p_reg->OUTSET = set_mask;
    20. 44faa: f8c3 4508 strmi.w r4, [r3, #1288] ; 0x508
    21. for (int bit=7;bit>=0;bit--) {
    22. 44fae: f112 32ff adds.w r2, r2, #4294967295 ; 0xffffffff
    23. 44fb2: f8c3 5508 str.w r5, [r3, #1288] ; 0x508
    24. p_reg->OUTCLR = clr_mask;
    25. 44fb6: f8c3 550c str.w r5, [r3, #1292] ; 0x50c
    26. 44fba: d2f0 bcs.n 44f9e <spiFlashWrite+0x16>
    27. for (unsigned int i=0;i<len;i++) {
    28. 44fbc: 4281 cmp r1, r0
    29. 44fbe: d1eb bne.n 44f98 <spiFlashWrite+0x10>
    30. }
    31. 44fc0: bdf0 pop {r4, r5, r6, r7, pc}

    And it seems reasonably tight - everything is already inlined. I'm not sure you're really going to see a noticeable difference in speed especially as I believe IO is stuck at 16MHz. If it did turn out that unrolling the loop was significantly faster then I'd be interested, but if at all possible I'd like to just get the compiler to do it so it wasn't such a maintenance headache: https://stackoverflow.com/questions/4071690/tell-gcc-to-specifically-unroll-a-loop

    I don't know for sure what the flash chip used in the Bangle is I'm afraid - as far as I can tell it's marked A1Y200

    Honestly I developed the whole thing from scratch on one Bangle.js device which is still working fine, so I don't think you'll brick your Bangle if you mess with the flash code. The bootloader doesn't use flash at all so you should still get able to get to it regardless of what you change.

    I'd strongly suggest you open your Bangle and attach a nRF52DK debugger to it though - it'll make uploading firmware so much faster.

    Another option is to use something like an MDBT42 breakout, attach it to an external flash chip, and try using 'Inline C' in the Web IDE to see how fast you can get the IO going.

    Some things that I think really might increase speed though:

    • Create a spiFlashRead so that when you're reading data from flash you're not also writing - that would be a really easy win
    • Keep CS asserted on the flash chip after a read, and if the next byte(s) requested come right after the previous bytes requested then you don't need to send the 4 bytes of address again - you just clock out more data. That should be pretty easy as well.
    • Merging the writes - don't use OUTSET and OUTCLR but write directly to OUT - you may be able to change the clock and data at the same time on one of the clock edges which would help to overcome the 16Mhz limitation.
    • The original watch firmware actually used QSPI (so 4 data wires) but it did it in software so it was basically slower than just bit-banging with 1 data bit! It's a good example of why you need to actually benchmark before optimising :) With some thought you may be able to do it properly though - using multiply + shift (or just a lookup table) to get the bits in the correct places for their pins with only a few clock cycles (taking some inspiration from https://graphics.stanford.edu/~seander/bithacks.html#ReverseByteWith64Bits).
    • Using hardware SPI - you may be able to get this faster than bit-banged SPI, but supposedly the max bitrate is still 8Mbps
    • Preloading with hardware SPI and DMA - a lot of the time when executing from flash, Espruino will load data in chunks of 16 bytes. You could potentially pre-load the next 16 bytes using DMA while Espruino runs through the first 16 bytes
About

Avatar for Gordon @Gordon started