faster spiFlashWriteByte

You are reading a single comment by @FransM and its replies. Click here to read the full conversation.

•

FransM

Optimisations that end up not calling the function at all are probably
going to be very significant :)
Lol yes, but that does generally not deliver the desired functionality.

I tried unrolling with gcc 9.3.1 and with adding

[#pragma](https://forum.espruino.com/search/?q=%23pragma) GCC unroll 8

before

for (int bit=7;bit>=0;bit--) {

The generated code:

for (unsigned int i=0;i<len;i++) {
   44260:       f04f 43a0       mov.w   r3, #1342177280 ; 0x50000000
static void spiFlashWrite(unsigned char *tx, unsigned int len) {
   44264:       b570            push    {r4, r5, r6, lr}
   44266:       4401            add     r1, r0
   44268:       f04f 6400       mov.w   r4, #134217728  ; 0x8000000
   4426c:       f44f 2200       mov.w   r2, #524288     ; 0x80000
   44270:       461d            mov     r5, r3
    int data = tx[i];
   44272:       f810 6b01       ldrb.w  r6, [r0], #1
    if (value == 0)
   44276:       ea5f 1cd6       movs.w  ip, r6, lsr #7
    p_reg->OUTSET = set_mask;
   4427a:       bf14            ite     ne
   4427c:       f8c3 4508       strne.w r4, [r3, #1288] ; 0x508
    p_reg->OUTCLR = clr_mask;
   44280:       f8c3 450c       streq.w r4, [r3, #1292] ; 0x50c
    if (value == 0)
   44284:       f016 0f40       tst.w   r6, #64 ; 0x40
    p_reg->OUTSET = set_mask;
   44288:       f8c3 2508       str.w   r2, [r3, #1288] ; 0x508
    p_reg->OUTCLR = clr_mask;
   4428c:       f8c3 250c       str.w   r2, [r3, #1292] ; 0x50c
    p_reg->OUTSET = set_mask;
   44290:       bf14            ite     ne
   44292:       f8c3 4508       strne.w r4, [r3, #1288] ; 0x508
    p_reg->OUTCLR = clr_mask;
   44296:       f8c3 450c       streq.w r4, [r3, #1292] ; 0x50c
repeat last 6 lines 6 more times (with different constant in the tst)
...

As can be seen now each iteration saves 3 instructions.
If I counted correctly the unrolled function (from entry to exit) takes 202 bytes. The non-unrolled version takes 84 bytes, so the cost is 118 bytes.
Benefit is that it saves about 5 clock cycles per bit (depends also a bit on the cycles needed for the branch). So for a byte this is 40 clock cycles or 2.5 uS. So if you write a 10k image it is 25 ms. Nice but not more than that.

And of course this is not specific for blitting. Actually I did prototype this on flash write which is not relevant for blitting at all.
I assume that this applies to some other places as well (e.g. reading from flash, writing to display, ...)

(Oh and I can understand fully if you do not want to take this but it was fun researching this)

faster spiFlashWriteByte

About

Espruino