• Just a thought - You could write some assembler code that did basically what the ILI9341pal driver does, but with DMA:

    1. Take 4bpp (?) data a chunk at a time and unpack it into another buffer with 16 bits.
    2. Kick off DMA from that buffer
    3. Take another 4bpp (?) data chunk and unpack it into another buffer with 16 bits.
    4. Wait for DMA to finish
    5. Goto 2

    Obviously you've got your current solution with the nice fonts so it's not a big deal, that that could end up being really interesting.

  • i implemented something similar, but with larger chunks (1..10kBytes); sending just 16bits with DMA is very slow. it might be even better to write directly to SPI TX? maybe DMA double buffer (DBM) helps, i did not try till now. but i am not sure if 16bits are enough get rid of the inter-byte gap (due to CPU load for DMA ready scanning). think i will give it a try.

    i have identified another performance brake: E.mapInPlace; i think it's use of JSVars slows down the lookup. using asm coded specialiced functions for 1/2/4bpp is about 20x faster.

    1. Take 4bpp (?) data a chunk at a time and unpack it into another buffer with 16 bits.
    2. Kick off DMA from that buffer
    3. Take another 4bpp (?) data chunk and unpack it into another buffer with 16 bits.
    4. Wait for DMA to finish
    5. Goto 2

    As already mentioned - to do a full DMA setup per 16bit is much too slow (because the necessary procedure for setting up, starting and then stopping dma/spi). so i tried it with double buffer (DMA) feature. basically this works nice, but it has some drawbacks:

    1. with 16bit payload at top speed of 12.5Mbaud we have to feed new data every 1,28us to the DMA buffers; on the EspruinoWIFI (100MHz, 1-2 cycles per instruction) this counts as ~85 instructions. well, it seems that some IRQ out there (timer?) blocks my feeder for longer than that.
    2. increasing the buffer by 10x (e.g. 10x16 bit) works perfect (6x16 does not), but... in this case we have to waste a lot of (blocking) cpu cycles while waiting for a buffer to get free for next data.
    3. in fact - a fast palette lookup (i am not talking about E.mapInPlace) - saves a lot of CPU load 'cause it's much faster than the SPI. it's not my intention to waste then these savings with blocking waits for background DMA ;)

    the best way of pressing paletted image data seems to me:

    • unpalette graphics data (typ. 1/2/4bpp into 16bpp) chunk by chunk with a really fast lookup method
    • each chunk as huge as possible (typ. 2..20 kbyte)
    • this allows parallel processing of JS code while the last chunk goes over the line

    some simple measurements of optimized asm 1/2/4 into 16bit lookup functions show that unpaletting is typ. 3..10 times faster than the net SPI transmission time. e.g. on 10k pixels @1bpp we bring >11ms (12.80-1.44) of cpu time back to JS compared to any blocking method.

    test_map1to16  1000 pxls 0.43  1.28 2.9x
    test_map1to16 10000 pxls 1.44 12.80 8.9x
    test_map2to16  1000 pxls 0.44  1.28 2.9x
    test_map2to16 10000 pxls 1.56 12.80 8.2x
    test_map4to16  1000 pxls 0.48  1.28 2.7x
    test_map4to16 10000 pxls 1.87 12.80 6.8x
    
About

Avatar for Gordon @Gordon started