You are reading a single comment by @FransM and its replies. Click here to read the full conversation.
  • Optimisations that end up not calling the function at all are probably
    going to be very significant :)
    Lol yes, but that does generally not deliver the desired functionality.

    I tried unrolling with gcc 9.3.1 and with adding

    [#pragma](http://forum.espruino.com/sear­ch/?q=%23pragma) GCC unroll 8
    

    before

    for (int bit=7;bit>=0;bit--) {
    

    The generated code:

    for (unsigned int i=0;i<len;i++) {
       44260:       f04f 43a0       mov.w   r3, #1342177280 ; 0x50000000
    static void spiFlashWrite(unsigned char *tx, unsigned int len) {
       44264:       b570            push    {r4, r5, r6, lr}
       44266:       4401            add     r1, r0
       44268:       f04f 6400       mov.w   r4, #134217728  ; 0x8000000
       4426c:       f44f 2200       mov.w   r2, #524288     ; 0x80000
       44270:       461d            mov     r5, r3
        int data = tx[i];
       44272:       f810 6b01       ldrb.w  r6, [r0], #1
        if (value == 0)
       44276:       ea5f 1cd6       movs.w  ip, r6, lsr #7
        p_reg->OUTSET = set_mask;
       4427a:       bf14            ite     ne
       4427c:       f8c3 4508       strne.w r4, [r3, #1288] ; 0x508
        p_reg->OUTCLR = clr_mask;
       44280:       f8c3 450c       streq.w r4, [r3, #1292] ; 0x50c
        if (value == 0)
       44284:       f016 0f40       tst.w   r6, #64 ; 0x40
        p_reg->OUTSET = set_mask;
       44288:       f8c3 2508       str.w   r2, [r3, #1288] ; 0x508
        p_reg->OUTCLR = clr_mask;
       4428c:       f8c3 250c       str.w   r2, [r3, #1292] ; 0x50c
        p_reg->OUTSET = set_mask;
       44290:       bf14            ite     ne
       44292:       f8c3 4508       strne.w r4, [r3, #1288] ; 0x508
        p_reg->OUTCLR = clr_mask;
       44296:       f8c3 450c       streq.w r4, [r3, #1292] ; 0x50c
    repeat last 6 lines 6 more times (with different constant in the tst)
    ...
    

    As can be seen now each iteration saves 3 instructions.
    If I counted correctly the unrolled function (from entry to exit) takes 202 bytes. The non-unrolled version takes 84 bytes, so the cost is 118 bytes.
    Benefit is that it saves about 5 clock cycles per bit (depends also a bit on the cycles needed for the branch). So for a byte this is 40 clock cycles or 2.5 uS. So if you write a 10k image it is 25 ms. Nice but not more than that.

    And of course this is not specific for blitting. Actually I did prototype this on flash write which is not relevant for blitting at all.
    I assume that this applies to some other places as well (e.g. reading from flash, writing to display, ...)

    (Oh and I can understand fully if you do not want to take this but it was fun researching this)

About

Avatar for FransM @FransM started