You are reading a single comment by @FransM and its replies. Click here to read the full conversation.
  • Eliminating the branch will save one clock cycle and up to 3 cycles to refill the pipeline.
    The inner loop that is executed 8 times, consists of 9 instructions including the branch

      44f9e:   fa46 f702   asr.w   r7, r6, r2
       44fa2:   07ff        lsls    r7, r7, #31
       44fa4:   bf54        ite pl
       44fa6:   f8c3 450c   strpl.w r4, [r3, #1292] ; 0x50c
       44faa:   f8c3 4508   strmi.w r4, [r3, #1288] ; 0x508
       44fae:   f112 32ff   adds.w  r2, r2, #4294967295 ; 0xffffffff
       44fb2:   f8c3 5508   str.w   r5, [r3, #1288] ; 0x508
       44fb6:   f8c3 550c   str.w   r5, [r3, #1292] ; 0x50c
       44fba:   d2f0        bcs.n   44f9e <spiFlashWrite+0x16>
    

    I did not count all the instruction cycles, but the bcs instruction might well count for 10% of the time in the loop.
    (the str.w instructions are one cycle so it is probably even more; but of course the only real way to find this is by benchmarking; I haven't figured out what the easiest way to do this).

    The other suggestions also will help but gains in the inner loop are 8 times as effective as an optimisation in the outer loop.

    Afterthought: if we unroll the loop then the loading of r2 and the adds.w are also not needed.
    That might well increase the gain to 20-25%.
    It is too late now, but I'll try to do the counting later this week (and provide a better patch)

    Edit: when unrolling it might also be that the asr.w can be simplified (and if not the decrement of r7 also will be kept.
    I'll try to make a version with an unrolled loop with #pragma and see how that disassembles.

    Wrt measuring:
    What I am missing is that I do not know yet how to realise the JS to C bindings.

    For measuring maybe the DWT CYCCNT register can be used.

About

Avatar for FransM @FransM started