-
i implemented something similar, but with larger chunks (1..10kBytes); sending just 16bits with DMA is very slow. it might be even better to write directly to SPI TX? maybe DMA double buffer (DBM) helps, i did not try till now. but i am not sure if 16bits are enough get rid of the inter-byte gap (due to CPU load for DMA ready scanning). think i will give it a try.
i have identified another performance brake: E.mapInPlace; i think it's use of JSVars slows down the lookup. using asm coded specialiced functions for 1/2/4bpp is about 20x faster.
-
- Take 4bpp (?) data a chunk at a time and unpack it into another buffer with 16 bits.
- Kick off DMA from that buffer
- Take another 4bpp (?) data chunk and unpack it into another buffer with 16 bits.
- Wait for DMA to finish
- Goto 2
As already mentioned - to do a full DMA setup per 16bit is much too slow (because the necessary procedure for setting up, starting and then stopping dma/spi). so i tried it with double buffer (DMA) feature. basically this works nice, but it has some drawbacks:
- with 16bit payload at top speed of 12.5Mbaud we have to feed new data every 1,28us to the DMA buffers; on the EspruinoWIFI (100MHz, 1-2 cycles per instruction) this counts as ~85 instructions. well, it seems that some IRQ out there (timer?) blocks my feeder for longer than that.
- increasing the buffer by 10x (e.g. 10x16 bit) works perfect (6x16 does not), but... in this case we have to waste a lot of (blocking) cpu cycles while waiting for a buffer to get free for next data.
- in fact - a fast palette lookup (i am not talking about E.mapInPlace) - saves a lot of CPU load 'cause it's much faster than the SPI. it's not my intention to waste then these savings with blocking waits for background DMA ;)
the best way of pressing paletted image data seems to me:
- unpalette graphics data (typ. 1/2/4bpp into 16bpp) chunk by chunk with a really fast lookup method
- each chunk as huge as possible (typ. 2..20 kbyte)
- this allows parallel processing of JS code while the last chunk goes over the line
some simple measurements of optimized asm 1/2/4 into 16bit lookup functions show that unpaletting is typ. 3..10 times faster than the net SPI transmission time. e.g. on 10k pixels @1bpp we bring >11ms (12.80-1.44) of cpu time back to JS compared to any blocking method.
test_map1to16 1000 pxls 0.43 1.28 2.9x test_map1to16 10000 pxls 1.44 12.80 8.9x test_map2to16 1000 pxls 0.44 1.28 2.9x test_map2to16 10000 pxls 1.56 12.80 8.2x test_map4to16 1000 pxls 0.48 1.28 2.7x test_map4to16 10000 pxls 1.87 12.80 6.8x
- Take 4bpp (?) data a chunk at a time and unpack it into another buffer with 16 bits.
Just a thought - You could write some assembler code that did basically what the ILI9341pal driver does, but with DMA:
Obviously you've got your current solution with the nice fonts so it's not a big deal, that that could end up being really interesting.