Optimisations that end up not calling the function at all are probably
going to be very significant :)
Lol yes, but that does generally not deliver the desired functionality.
As can be seen now each iteration saves 3 instructions.
If I counted correctly the unrolled function (from entry to exit) takes 202 bytes. The non-unrolled version takes 84 bytes, so the cost is 118 bytes.
Benefit is that it saves about 5 clock cycles per bit (depends also a bit on the cycles needed for the branch). So for a byte this is 40 clock cycles or 2.5 uS. So if you write a 10k image it is 25 ms. Nice but not more than that.
And of course this is not specific for blitting. Actually I did prototype this on flash write which is not relevant for blitting at all.
I assume that this applies to some other places as well (e.g. reading from flash, writing to display, ...)
(Oh and I can understand fully if you do not want to take this but it was fun researching this)
Espruino is a JavaScript interpreter for low-power Microcontrollers. This site is both a support community for Espruino and a place to share what you are working on.
I tried unrolling with gcc 9.3.1 and with adding
before
The generated code:
As can be seen now each iteration saves 3 instructions.
If I counted correctly the unrolled function (from entry to exit) takes 202 bytes. The non-unrolled version takes 84 bytes, so the cost is 118 bytes.
Benefit is that it saves about 5 clock cycles per bit (depends also a bit on the cycles needed for the branch). So for a byte this is 40 clock cycles or 2.5 uS. So if you write a 10k image it is 25 ms. Nice but not more than that.
And of course this is not specific for blitting. Actually I did prototype this on flash write which is not relevant for blitting at all.
I assume that this applies to some other places as well (e.g. reading from flash, writing to display, ...)
(Oh and I can understand fully if you do not want to take this but it was fun researching this)