faster spiFlashWriteByte

Posted on
  • Hi,

    I authored a version of spiFlashWriteByte that is a bit more efficient (at the expense of using a bit more rom due to loop unrolling. Appreciate feedback on this.
    I'm also interested on how best to test this (I want to avoid bricking my bangle.js)

    Code is at:
    https://github.com/FransM/Espruino/tree/spiFlashWriteByte

    Something similar can be done for spiFlashReadWriteByte

    (oh and the rationale for this, is that these two functions are very low level functions to access the spi flash so every clock cycle saved here pays off).

    Edit: what flash chip is actually used in bangje.js?

  • Hi,

    Honestly, I think the only way to really be sure with stuff like this is to benchmark (and also look at the disassembly (... make lst). I just looked at the disassembly of the existing spiFlashWrite:

    00044f88 <spiFlashWrite>:
    static void spiFlashWrite(unsigned char *tx, unsigned int len) {
       44f88:	b5f0      	push	{r4, r5, r6, r7, lr}
       44f8a:	4401      	add	r1, r0
       44f8c:	f04f 43a0 	mov.w	r3, #1342177280	; 0x50000000
       44f90:	f04f 6400 	mov.w	r4, #134217728	; 0x8000000
       44f94:	f44f 2500 	mov.w	r5, #524288	; 0x80000
        int data = tx[i];
       44f98:	f810 6b01 	ldrb.w	r6, [r0], #1
        for (int bit=7;bit>=0;bit--) {
       44f9c:	2207      	movs	r2, #7
          nrf_gpio_pin_write((uint32_t)pinInfo[SPIFLASH_PIN_MOSI].pin, (data>>bit)&1 );
       44f9e:	fa46 f702 	asr.w	r7, r6, r2
        if (value == 0)
       44fa2:	07ff      	lsls	r7, r7, #31
        p_reg->OUTCLR = clr_mask;
       44fa4:	bf54      	ite	pl
       44fa6:	f8c3 450c 	strpl.w	r4, [r3, #1292]	; 0x50c
        p_reg->OUTSET = set_mask;
       44faa:	f8c3 4508 	strmi.w	r4, [r3, #1288]	; 0x508
        for (int bit=7;bit>=0;bit--) {
       44fae:	f112 32ff 	adds.w	r2, r2, #4294967295	; 0xffffffff
       44fb2:	f8c3 5508 	str.w	r5, [r3, #1288]	; 0x508
        p_reg->OUTCLR = clr_mask;
       44fb6:	f8c3 550c 	str.w	r5, [r3, #1292]	; 0x50c
       44fba:	d2f0      	bcs.n	44f9e <spiFlashWrite+0x16>
      for (unsigned int i=0;i<len;i++) {
       44fbc:	4281      	cmp	r1, r0
       44fbe:	d1eb      	bne.n	44f98 <spiFlashWrite+0x10>
    }
       44fc0:	bdf0      	pop	{r4, r5, r6, r7, pc}
    

    And it seems reasonably tight - everything is already inlined. I'm not sure you're really going to see a noticeable difference in speed especially as I believe IO is stuck at 16MHz. If it did turn out that unrolling the loop was significantly faster then I'd be interested, but if at all possible I'd like to just get the compiler to do it so it wasn't such a maintenance headache: https://stackoverflow.com/questions/4071690/tell-gcc-to-specifically-unroll-a-loop

    I don't know for sure what the flash chip used in the Bangle is I'm afraid - as far as I can tell it's marked A1Y200

    Honestly I developed the whole thing from scratch on one Bangle.js device which is still working fine, so I don't think you'll brick your Bangle if you mess with the flash code. The bootloader doesn't use flash at all so you should still get able to get to it regardless of what you change.

    I'd strongly suggest you open your Bangle and attach a nRF52DK debugger to it though - it'll make uploading firmware so much faster.

    Another option is to use something like an MDBT42 breakout, attach it to an external flash chip, and try using 'Inline C' in the Web IDE to see how fast you can get the IO going.

    Some things that I think really might increase speed though:

    • Create a spiFlashRead so that when you're reading data from flash you're not also writing - that would be a really easy win
    • Keep CS asserted on the flash chip after a read, and if the next byte(s) requested come right after the previous bytes requested then you don't need to send the 4 bytes of address again - you just clock out more data. That should be pretty easy as well.
    • Merging the writes - don't use OUTSET and OUTCLR but write directly to OUT - you may be able to change the clock and data at the same time on one of the clock edges which would help to overcome the 16Mhz limitation.
    • The original watch firmware actually used QSPI (so 4 data wires) but it did it in software so it was basically slower than just bit-banging with 1 data bit! It's a good example of why you need to actually benchmark before optimising :) With some thought you may be able to do it properly though - using multiply + shift (or just a lookup table) to get the bits in the correct places for their pins with only a few clock cycles (taking some inspiration from https://graphics.stanford.edu/~seander/bithacks.html#ReverseByteWith64Bits).
    • Using hardware SPI - you may be able to get this faster than bit-banged SPI, but supposedly the max bitrate is still 8Mbps
    • Preloading with hardware SPI and DMA - a lot of the time when executing from flash, Espruino will load data in chunks of 16 bytes. You could potentially pre-load the next 16 bytes using DMA while Espruino runs through the first 16 bytes
  • Hi Gordon,

    Thanks for the extensive reply. I was unaware of make lst, I used cc commands that I found by running make under sh -x

    Looking at the assembly unrolling the for loop would eliminate the branch at line 26. M4 has no instruction cache and no speculative execution or so, so that would help a bit.
    Doing this with a gcc pragma is definitely better. I was unaware of that possibility.

    DMA might not be faster. I've seen situations where the setup time exceeded the time needed for bitbanging.

    And unfortunately I have no debugger. I think I didn't see that option when ordering my bangle (or should I have gotten it from somewhere else?)

    Note also that I am still learning about the ecosystem and software structure.
    (and the reason I asked about the flash chip I hoped that its datasheet would also give info on how to drive it).

  • Yes, it would eliminate the branch but I think in real terms that may not make a noticeable difference.

    The debugger isn't something I provide - you'd just have to wire up an nRF52832DK board. 3 wires basically, but you do have to pull back the LCD so it's not ideal if you plan on putting the watch back together again :)

    https://www.espruino.com/Bangle.js+Technical - there's a link to Macronix MX25 which seems to be a similar chip. I believe pretty much all SPI flash are driven the same way though

  • Eliminating the branch will save one clock cycle and up to 3 cycles to refill the pipeline.
    The inner loop that is executed 8 times, consists of 9 instructions including the branch

      44f9e:   fa46 f702   asr.w   r7, r6, r2
       44fa2:   07ff        lsls    r7, r7, #31
       44fa4:   bf54        ite pl
       44fa6:   f8c3 450c   strpl.w r4, [r3, #1292] ; 0x50c
       44faa:   f8c3 4508   strmi.w r4, [r3, #1288] ; 0x508
       44fae:   f112 32ff   adds.w  r2, r2, #4294967295 ; 0xffffffff
       44fb2:   f8c3 5508   str.w   r5, [r3, #1288] ; 0x508
       44fb6:   f8c3 550c   str.w   r5, [r3, #1292] ; 0x50c
       44fba:   d2f0        bcs.n   44f9e <spiFlashWrite+0x16>
    

    I did not count all the instruction cycles, but the bcs instruction might well count for 10% of the time in the loop.
    (the str.w instructions are one cycle so it is probably even more; but of course the only real way to find this is by benchmarking; I haven't figured out what the easiest way to do this).

    The other suggestions also will help but gains in the inner loop are 8 times as effective as an optimisation in the outer loop.

    Afterthought: if we unroll the loop then the loading of r2 and the adds.w are also not needed.
    That might well increase the gain to 20-25%.
    It is too late now, but I'll try to do the counting later this week (and provide a better patch)

    Edit: when unrolling it might also be that the asr.w can be simplified (and if not the decrement of r7 also will be kept.
    I'll try to make a version with an unrolled loop with #pragma and see how that disassembles.

    Wrt measuring:
    What I am missing is that I do not know yet how to realise the JS to C bindings.

    For measuring maybe the DWT CYCCNT register can be used.

  • The other suggestions also will help but

    Optimisations that end up not calling the function at all are probably going to be very significant :)

    For measuring maybe the DWT CYCCNT register can be used.

    Maybe... IMO the best way to benchmark this is to write enough data that it takes a noticeable amount of time (you can do it with CS not asserted so it's ignored) and then measure that.

    But just in general, based on your other comments: I'm all for optimising flash memory access, graphics, and Espruino in general - but if this is specifically for a single-purpose function for blitting a small portion of screen it may not be something I'd want to pull into Bangle.js, given we're currently using pretty much all the available flash memory.

  • Optimisations that end up not calling the function at all are probably
    going to be very significant :)
    Lol yes, but that does generally not deliver the desired functionality.

    I tried unrolling with gcc 9.3.1 and with adding

    [#pragma](https://forum.espruino.com/search/?q=%23pragma) GCC unroll 8
    

    before

    for (int bit=7;bit>=0;bit--) {
    

    The generated code:

    for (unsigned int i=0;i<len;i++) {
       44260:       f04f 43a0       mov.w   r3, #1342177280 ; 0x50000000
    static void spiFlashWrite(unsigned char *tx, unsigned int len) {
       44264:       b570            push    {r4, r5, r6, lr}
       44266:       4401            add     r1, r0
       44268:       f04f 6400       mov.w   r4, #134217728  ; 0x8000000
       4426c:       f44f 2200       mov.w   r2, #524288     ; 0x80000
       44270:       461d            mov     r5, r3
        int data = tx[i];
       44272:       f810 6b01       ldrb.w  r6, [r0], #1
        if (value == 0)
       44276:       ea5f 1cd6       movs.w  ip, r6, lsr #7
        p_reg->OUTSET = set_mask;
       4427a:       bf14            ite     ne
       4427c:       f8c3 4508       strne.w r4, [r3, #1288] ; 0x508
        p_reg->OUTCLR = clr_mask;
       44280:       f8c3 450c       streq.w r4, [r3, #1292] ; 0x50c
        if (value == 0)
       44284:       f016 0f40       tst.w   r6, #64 ; 0x40
        p_reg->OUTSET = set_mask;
       44288:       f8c3 2508       str.w   r2, [r3, #1288] ; 0x508
        p_reg->OUTCLR = clr_mask;
       4428c:       f8c3 250c       str.w   r2, [r3, #1292] ; 0x50c
        p_reg->OUTSET = set_mask;
       44290:       bf14            ite     ne
       44292:       f8c3 4508       strne.w r4, [r3, #1288] ; 0x508
        p_reg->OUTCLR = clr_mask;
       44296:       f8c3 450c       streq.w r4, [r3, #1292] ; 0x50c
    repeat last 6 lines 6 more times (with different constant in the tst)
    ...
    

    As can be seen now each iteration saves 3 instructions.
    If I counted correctly the unrolled function (from entry to exit) takes 202 bytes. The non-unrolled version takes 84 bytes, so the cost is 118 bytes.
    Benefit is that it saves about 5 clock cycles per bit (depends also a bit on the cycles needed for the branch). So for a byte this is 40 clock cycles or 2.5 uS. So if you write a 10k image it is 25 ms. Nice but not more than that.

    And of course this is not specific for blitting. Actually I did prototype this on flash write which is not relevant for blitting at all.
    I assume that this applies to some other places as well (e.g. reading from flash, writing to display, ...)

    (Oh and I can understand fully if you do not want to take this but it was fun researching this)

  • Thanks - that's really interesting. Did you manage to get any real world performance stats though? It's fine in theory but I'm pretty sure the speed ends up limited by the IO bus speed.

    Since JS code saved to flash is executed from flash, you should see some improvements in actual JS execution speed. I tend to use this mandelbrot code as a speed test:

    function mandelbrot() {
     var tm = getTime();
     var sizex=32,sizey=32;
     for (var y=0;y<sizey;y++) {
      var line="";
      for (var x=0;x<sizex;x++) {
       var Xr=0;
       var Xi=0;
       var Cr=(4.0*x/sizex)-2.0;
       var Ci=(4.0*y/sizey)-2.0;
       var i=0;
       while ((i<16) && ((Xr*Xr+Xi*Xi)<4)) {
        var t=Xr*Xr - Xi*Xi + Cr;
        Xi=2*Xr*Xi+Ci;
        Xr=t;
        i++;
       }
       line += (i&1) ? "*" : " ";
      }
      print(line);
     }
     print(getTime()-tm);
    }
    
    // from RAM
    mandelbrot(); // 7.44 sec
    // save to flash and then read&eval, so it'll end up in flash
    require("Storage").write("mandeltest","("+mandelbrot.toString()+")");
    mandelbrot = eval(require("Storage").read("mandeltest"));
    // execute from flash
    mandelbrot(); // 8.79 sec
    

    If you copy & paste this in the left hand size hopefully you'll see an improvement.

  • Just to add, I had a quick look at this, and I replaced the spiFlashReadWrite with a spiFlashRead which also avoided a memset as well - it reduces the exceution time from 8.79 to 8.58. Might not seem like much but there's a lot of other overhead for those calculations!

    • You can turn on GCC's -O3 for single functions, so I gave that a try, and it actually unrolls the loop anyway. It actually stopped Bangle.js being able to read from flash - so it turns out you can go too fast!
    • Using GCC's -O2 for the functions instead of the standard (-Os for size) actually made things slower!
  • there is also unused SPI2 interface, if you are really going for speed, that could be even faster. spi is already compiled in so it should not enlarge code size as the driver code is already there. or it can be even written without using nordic drivers, without using interrupts it is just writing RXD,TXD registers and polling ready flag, this would be similar style to current bitbanging way but could be even shorter (and faster)

  • I don't think I/O will be the bottleneck. This is bit-banged SPI so the clock is generated by software. Of course it could be that the chip introduces additional latency when writing to a GPIO pin, but generally that is pretty fast, just sending a bit to an output pin.

    Wrt the test:
    I'm using chromium (version 83) on ubuntu 18.04.04.
    I cannot paste in the left pane of the IDE, only in the right pane.
    Should I use a different browser?

    For the test: as I only updated spiFlashWrite I don't expect any gains as executing from flash is only read, but let me try.

    Copying all but the last three lines to the right pane and selecting execute from ram gives me 7.4944 seconds.
    Running the last thee lines give me 8.9745 so no gain.

    Let's try to also unroll the loop for spiFlashReadWrite
    text size in the lst file before is 0x0005d1e4 bytes
    after unrolling it is 0x0005d2d4
    difference is F0 == 220 decimal

    rerunning the exe now gives me 8.7609, 2nd run gives 8.7582

    Observations/tentative conclusions

    1. My execution time when running from RAM is slightly longer (about 0.05 sec so < 1%
    2. When running from flash is 8.97; substantially longer than your 8.79 (unless you accidently swapped digits when typing the comment of course). A reason for this could be that I compiled with gcc 9 (as mentioned in another post).
    3. With the spiFlashReadWrite loop unrolled I get a gain of 0.21 seconds (8.97-8.76). That would mean 2.3%. However the RAM run needed 7.49 seconds, so the original code for me used 8.97-7.49 = 1.48 seconds more when executing from flash compared when executing from RAM whereas the new code takes 8.76-7.49 = 1.27 seconds. So the I/O based speedup is about 14% ((0.21/1.48) * 100%)

    While looking at the code I noticed that some more optimization opportunities. I'll look into those as well and report back.

  • Of course it could be that the chip introduces additional latency when writing to a GPIO pin

    Yes, that's what I mean. I believe the IO bus on-chip is 16MHz

    I cannot paste in the left pane of the IDE, only in the right pane.

    Works for me. Ubuntu + Chrome too. Ctrl+V ?

    Actually I found maybe a better set of tests for you:

    // write an empty (0xff) file of 115kB
    require("Storage").write("test","",0,240*240*2);
    
    // draw it to screen (all white)
    t=getTime();g.drawImage({width:240,height:240,bpp:16,buffer:require("Storage").read("test")});print(getTime()-t);
    // 0.47 on a recent-ish build
    // 0.36 with most recent changes
    
    // sum all bytes
    t=getTime();E.sum(require("Storage").read("test"));print(getTime()-t);
    // 0.80 on a recent-ish build
    // 0.67 with most recent changes
    
    // crc
    t=getTime();E.CRC32(require("Storage").read("test"));print(getTime()-t);
    // 0.63 on a recent-ish build
    // 0.50 with most recent changes
    

    More info at https://github.com/espruino/Espruino/issues/1849

  • Yes @fanoush - that was one of my initial suggestions - it could definitely be worth a try... On Bangle.js I don't even use SPI1 for anything :)

    The latest changes are no committed to Git so you should see some decent improvements.

    The speed difference could well be the new GCC - probably the optimise for size uses some slightly different heuristics.

    But yes, if you see other optimisation improvements I'm definitely interested!

    Especially in drawImage(s) we have some fasts paths for drawing, but if you don't hit them then the unpacking of bitmap data isn't that fast and could probably be improved using a pointer to a decode function.

  • Works for me. Ubuntu + Chrome too. Ctrl+V ?
    Aaargh. Yes. I right-clicked and saw no paste option.

    Triggered by the comments in your code snippet I decided to sync with the latest version and measure again

    Indeed now unrolling also fails for me. Looking at the assembly the whole function spiFlashRead is inlined in jshFlashRead.
    In the non-unrolled version the loop reading the bits looks like:

    45782:       2108            movs    r1, #8
       45784:       2200            movs    r2, #0
       45786:       f8c3 c508       str.w   ip, [r3, #1288] ; 0x508
       4578a:       f8d3 4510       ldr.w   r4, [r3, #1296] ; 0x510
       4578e:       f8c3 c50c       str.w   ip, [r3, #1292] ; 0x50c
       45792:       f3c4 5400       ubfx    r4, r4, #20, #1
       45796:       3901            subs    r1, #1
       45798:       ea44 0242       orr.w   r2, r4, r2, lsl #1
       4579c:       d1f3            bne.n   45786 <jshFlashRead+0x72>
    

    I stripped off the source as they were misleading.
    The first two lines are not part of the loop, r1 is the loop counter, r2 is where the results go to.
    7 instructions in the loop (including the branch)

    After unrolling the following 5 instructions are executed to read a bit

    p_reg->OUTSET = set_mask;
       457c4:       f8c3 1508       str.w   r1, [r3, #1288] ; 0x508
       457c8:       ea42 0244       orr.w   r2, r2, r4, lsl #1
        return p_reg->IN;
       457cc:       f8d3 4510       ldr.w   r4, [r3, #1296] ; 0x510
        p_reg->OUTCLR = clr_mask;
       457d0:       f8c3 150c       str.w   r1, [r3, #1292] ; 0x50c
        return ((nrf_gpio_port_in_read(reg) >> pin_number) & 1UL);
       457d4:       f3c4 5400       ubfx    r4, r4, #20, #1
    

    After that there is another write to set the clk high (write to reg OUTSET (0x508)

    I suspect that things fail because the clock down time is too short.
    The data sheet of the NRF specifies a value of 25 ns for a 50 pF load. The minimum clock down time needed by the flash chip we do not know.
    Guess there is not much more to be gained here.

  • One last observation to report.
    I noticed that after

    commit 49b4b523accc22539eb4156085cde748a208170a Author: Gordon
    Williams gw@pur3.co.uk Date: Fri Jun 5 09:52:56 2020 +0100

    Bangle.js: Improve SPI flash speed by with specific function for reading and keeping CS asserted ( #1849)
    

    there was only one caller left to spiFlashReadWrite, namely in spiFlashStatus.
    I figured I could equally well eliminate that function and replace it with a call to spiFlashWrite followed by a call to spiFlashRead

    This works like a charm but degrades the performace a bit as jshFlashRead does not inline spiFlashRead any more. Without my change jshFlashRead was the only caller of spiFlashRead and the compiler or the post linker optimizer decided the function could be inlined.
    Not any more after changing spiFlashStatus to use separate Read and Write calls.

    It might we worthwhile to mark spiFlashRead and spiFlashWrite as inline functions. Thatwill (marginally) speed up jshFlashRead as the call to spiFlashWrite will be gone.

  • but degrades the performace a bit as jshFlashRead does not inline spiFlashRead any more

    That's odd - actually when I tried it, spiFlashRead was never inlined (before or after changes). And it is still only called once afaik.

    However you're talking about pretty minimal clock cycle impacts since it's only called once per read (which'll be 16 bytes normally) - again, if it speeds things up noticeably I'm all for it - but I'm not convinced you will see a noticeable increase in read speeds

  • Probably the function was inlined because I was using gcc 9
    Performance impact is indeed minimal.

    If you want to get rid of spiFlashReadWrite you can merge this pull request
    https://github.com/espruino/Espruino/pull/1850

    Edit: just force-pushed an update to the branch as I forgot to add an entry in Changelog

  • Thanks! Don't worry about the changelog - the commit comes after all the other stuff to change flash memory access so I don't think it's a big deal.

  • Post a reply
    • Bold
    • Italics
    • Link
    • Image
    • List
    • Quote
    • code
    • Preview
About

faster spiFlashWriteByte

Posted by Avatar for FransM @FransM

Actions