Eliminating the branch will save one clock cycle and up to 3 cycles to refill the pipeline.
The inner loop that is executed 8 times, consists of 9 instructions including the branch
I did not count all the instruction cycles, but the bcs instruction might well count for 10% of the time in the loop.
(the str.w instructions are one cycle so it is probably even more; but of course the only real way to find this is by benchmarking; I haven't figured out what the easiest way to do this).
The other suggestions also will help but gains in the inner loop are 8 times as effective as an optimisation in the outer loop.
Afterthought: if we unroll the loop then the loading of r2 and the adds.w are also not needed.
That might well increase the gain to 20-25%.
It is too late now, but I'll try to do the counting later this week (and provide a better patch)
Edit: when unrolling it might also be that the asr.w can be simplified (and if not the decrement of r7 also will be kept.
I'll try to make a version with an unrolled loop with #pragma and see how that disassembles.
Wrt measuring:
What I am missing is that I do not know yet how to realise the JS to C bindings.
For measuring maybe the DWT CYCCNT register can be used.
Espruino is a JavaScript interpreter for low-power Microcontrollers. This site is both a support community for Espruino and a place to share what you are working on.
Eliminating the branch will save one clock cycle and up to 3 cycles to refill the pipeline.
The inner loop that is executed 8 times, consists of 9 instructions including the branch
I did not count all the instruction cycles, but the bcs instruction might well count for 10% of the time in the loop.
(the str.w instructions are one cycle so it is probably even more; but of course the only real way to find this is by benchmarking; I haven't figured out what the easiest way to do this).
The other suggestions also will help but gains in the inner loop are 8 times as effective as an optimisation in the outer loop.
Afterthought: if we unroll the loop then the loading of r2 and the adds.w are also not needed.
That might well increase the gain to 20-25%.
It is too late now, but I'll try to do the counting later this week (and provide a better patch)
Edit: when unrolling it might also be that the asr.w can be simplified (and if not the decrement of r7 also will be kept.
I'll try to make a version with an unrolled loop with #pragma and see how that disassembles.
Wrt measuring:
What I am missing is that I do not know yet how to realise the JS to C bindings.
For measuring maybe the DWT CYCCNT register can be used.