Avatar for FransM


Member since Apr 2020 • Last active Jun 2020
  • 7 conversations

Most recent activity

  • in Bangle.js
    Avatar for FransM

    Optimisations that end up not calling the function at all are probably
    going to be very significant :)
    Lol yes, but that does generally not deliver the desired functionality.

    I tried unrolling with gcc 9.3.1 and with adding

    [#pragma](http://forum.espruino.com/sear­ch/?q=%23pragma) GCC unroll 8


    for (int bit=7;bit>=0;bit--) {

    The generated code:

    for (unsigned int i=0;i<len;i++) {
       44260:       f04f 43a0       mov.w   r3, #1342177280 ; 0x50000000
    static void spiFlashWrite(unsigned char *tx, unsigned int len) {
       44264:       b570            push    {r4, r5, r6, lr}
       44266:       4401            add     r1, r0
       44268:       f04f 6400       mov.w   r4, #134217728  ; 0x8000000
       4426c:       f44f 2200       mov.w   r2, #524288     ; 0x80000
       44270:       461d            mov     r5, r3
        int data = tx[i];
       44272:       f810 6b01       ldrb.w  r6, [r0], #1
        if (value == 0)
       44276:       ea5f 1cd6       movs.w  ip, r6, lsr #7
        p_reg->OUTSET = set_mask;
       4427a:       bf14            ite     ne
       4427c:       f8c3 4508       strne.w r4, [r3, #1288] ; 0x508
        p_reg->OUTCLR = clr_mask;
       44280:       f8c3 450c       streq.w r4, [r3, #1292] ; 0x50c
        if (value == 0)
       44284:       f016 0f40       tst.w   r6, #64 ; 0x40
        p_reg->OUTSET = set_mask;
       44288:       f8c3 2508       str.w   r2, [r3, #1288] ; 0x508
        p_reg->OUTCLR = clr_mask;
       4428c:       f8c3 250c       str.w   r2, [r3, #1292] ; 0x50c
        p_reg->OUTSET = set_mask;
       44290:       bf14            ite     ne
       44292:       f8c3 4508       strne.w r4, [r3, #1288] ; 0x508
        p_reg->OUTCLR = clr_mask;
       44296:       f8c3 450c       streq.w r4, [r3, #1292] ; 0x50c
    repeat last 6 lines 6 more times (with different constant in the tst)

    As can be seen now each iteration saves 3 instructions.
    If I counted correctly the unrolled function (from entry to exit) takes 202 bytes. The non-unrolled version takes 84 bytes, so the cost is 118 bytes.
    Benefit is that it saves about 5 clock cycles per bit (depends also a bit on the cycles needed for the branch). So for a byte this is 40 clock cycles or 2.5 uS. So if you write a 10k image it is 25 ms. Nice but not more than that.

    And of course this is not specific for blitting. Actually I did prototype this on flash write which is not relevant for blitting at all.
    I assume that this applies to some other places as well (e.g. reading from flash, writing to display, ...)

    (Oh and I can understand fully if you do not want to take this but it was fun researching this)

  • in General
    Avatar for FransM

    I'm using ubuntu 18.04.04 LTS (64 bit).
    Web upgrade works.

    After transferring my own-build firmware to the phone I managed to flash the image to the phone. As far as I can judge it works nicely, but I did not try every available feature.

  • in General
    Avatar for FransM

    I want to report my findings on the cross compiler.
    The Espruino README_Building.md suggest to use gcc-arm-none-eabi-5_4-2016q3
    As I was interested in the pragma for loop unrolling I decided to download gcc-arm-none-eabi-9-2020-q2-update-x86_6­4-linux.tar.bz2 from https://developer.arm.com/tools-and-soft­ware/open-source-software/developer-tool­s/gnu-toolchain/gnu-rm/downloads

    The good news: I managed to compile (for bangle.js) and the .text section decreases from 0x4e6dc to 0x4e154 bytes so a code size reduction of 1516 bytes (0x588)

    The bad news I did not manage to upload my firmware.
    I use Chromium, connect to my bangle, go to Settings/Flasher select Flash from File, select the zip file. I get the firmware update dialog, bangle.js. select Next. The dialog disappears but nothing happens.
    What is going wrong here?

    (and the other bad news, much less important of course: according to the .lst file the loop unrolling pragma that I added did not have the desired effect, no unrolling was done,)

  • in Bangle.js
    Avatar for FransM

    Eliminating the branch will save one clock cycle and up to 3 cycles to refill the pipeline.
    The inner loop that is executed 8 times, consists of 9 instructions including the branch

      44f9e:   fa46 f702   asr.w   r7, r6, r2
       44fa2:   07ff        lsls    r7, r7, #31
       44fa4:   bf54        ite pl
       44fa6:   f8c3 450c   strpl.w r4, [r3, #1292] ; 0x50c
       44faa:   f8c3 4508   strmi.w r4, [r3, #1288] ; 0x508
       44fae:   f112 32ff   adds.w  r2, r2, #4294967295 ; 0xffffffff
       44fb2:   f8c3 5508   str.w   r5, [r3, #1288] ; 0x508
       44fb6:   f8c3 550c   str.w   r5, [r3, #1292] ; 0x50c
       44fba:   d2f0        bcs.n   44f9e <spiFlashWrite+0x16>

    I did not count all the instruction cycles, but the bcs instruction might well count for 10% of the time in the loop.
    (the str.w instructions are one cycle so it is probably even more; but of course the only real way to find this is by benchmarking; I haven't figured out what the easiest way to do this).

    The other suggestions also will help but gains in the inner loop are 8 times as effective as an optimisation in the outer loop.

    Afterthought: if we unroll the loop then the loading of r2 and the adds.w are also not needed.
    That might well increase the gain to 20-25%.
    It is too late now, but I'll try to do the counting later this week (and provide a better patch)

    Edit: when unrolling it might also be that the asr.w can be simplified (and if not the decrement of r7 also will be kept.
    I'll try to make a version with an unrolled loop with #pragma and see how that disassembles.

    Wrt measuring:
    What I am missing is that I do not know yet how to realise the JS to C bindings.

    For measuring maybe the DWT CYCCNT register can be used.

  • in Bangle.js
    Avatar for FransM

    Hi Gordon,

    Thanks for the extensive reply. I was unaware of make lst, I used cc commands that I found by running make under sh -x

    Looking at the assembly unrolling the for loop would eliminate the branch at line 26. M4 has no instruction cache and no speculative execution or so, so that would help a bit.
    Doing this with a gcc pragma is definitely better. I was unaware of that possibility.

    DMA might not be faster. I've seen situations where the setup time exceeded the time needed for bitbanging.

    And unfortunately I have no debugger. I think I didn't see that option when ordering my bangle (or should I have gotten it from somewhere else?)

    Note also that I am still learning about the ecosystem and software structure.
    (and the reason I asked about the flash chip I hoped that its datasheet would also give info on how to drive it).

  • in Bangle.js
    Avatar for FransM

    Let me try to clarify.
    What I want to achieve is a 16 bit full-screen clockface.
    My understanding is that if the hand moves I need restore (at minimum) the area that was covered by the hand at the old position that will not be overwritten by drawing the new hand.
    So this is not about drawing the hand, it is about restoring the background.

    Of course one can rewrite the old background in full: expensive.
    Or you can rewrite only sections (as suggested by Abhigkar earlier in this thread) (bounding box based, but you would need images to draw)
    Or you can store the background of every hand position an redraw that one (that is the 60 background images I was suggesting before, note that the background might be different when the hand is at a different position (I just want to have an image as watchface
    Or you can restore the background based upon the "old hand": that is instead of rendering the hand, restore/re-render the background pixels that were covered when the hand was drawn at the previous position.

    Or, rephrasing the problem:
    Suppose as watchface I want to have a high-res image of my kid. When the minute hand moves from say 0 to 1 am looking for an efficient way (both CPU and RAM efficient) to restore the area that has been overwritten by the hand while at minute 0 after which I can write the hand for minute 1. To do this as efficient as possible, I would like to redraw only the pixels that are actually overwritten.

    Did I now express my problem better?

  • in Bangle.js
    Avatar for FransM

    Rotate is quite ok for drawing the hand.
    What I wanted to say is that you might need 60 images each containing the background for each minute.

    I still feel it might be doable to restore background images based on the clock hand, if you have reasonably fast flash access and know where the picture data is (it does help if the data is stored continuously, so not having it scattered over the flash in different sectors; haven't studied the file system to see how the FS works).

    How much time do you think is needed to read a byte from flash at a given position?
    How much time to write a byte to LCD?
    And of course this does require low level access to make it performant.

    I'll see if I can come up with some pseudocode or C code to illustrate what I want, but it may be Friday or Saturday before I get to that.

  • in Bangle.js
    Avatar for FransM


    I authored a version of spiFlashWriteByte that is a bit more efficient (at the expense of using a bit more rom due to loop unrolling. Appreciate feedback on this.
    I'm also interested on how best to test this (I want to avoid bricking my bangle.js)

    Code is at:

    Something similar can be done for spiFlashReadWriteByte

    (oh and the rationale for this, is that these two functions are very low level functions to access the spi flash so every clock cycle saved here pays off).

    Edit: what flash chip is actually used in bangje.js?