SPI performance on mono SSH1106 128x64 screen - bulk SPI transfer

Posted on
Page
of 2
/ 2
Next
  • Hi, sorry if this has been covered before. I have a SH1106 LED device attached to Espruino and it's doing buffered graphics just fine but the performance is not stellar.

    Haven't timed it precisely but looks to take around 50ms to update the screen which gives 20fps.

    I know a little bit about SPI and can see some discussion in issues and elsewhere about the fact that Espruino sends a byte at a time, but could be bulk sending data to avoid the handshake overhead. I also noted the comments at https://github.com/espruino/Espruino/iss­ues/695 that memory is a constraint.

    Is it the case that the Espruino can't spare 1k of data for the SPI bulk transfer (on a mono 64x128 screen)? Or would it not help if the transfer was chunked in to pieces, even if they were only eg. 128 bytes?

    Thanks for a great project.
    Alfie.

  • Hi - are you sure it's the transfer speed, or is it actually updating the graphics itself that takes the time?

    You could make sure you're on version 1v96 or later as I did some stuff to improve graphics speed on that.

    SPI transfer speed itself is still pretty reasonable. The bug you point to was really for ESP8266 because the API they provided for SPI was slow for single bytes - on STM32 especially the SPI is reasonably fast - it'll push out 4Mbps if I recall, so is unlikely to be the issue with your display.

    You can increase the SPI clock rate really easily when you set SPI up, so that's a really easy boost: http://www.espruino.com/Reference#l_SPI_­setup

    Default is 100k, but you could push that up to 1M pretty easily or the display may take 4M.

    This is the driver for the display: http://www.espruino.com/modules/SH1106.j­s

    It does send data in chunks, but IMO it could be faster. You could try adding something like this after initialising the graphics and see if that makes a difference?

    g.flip = function() { 
            var page = 0xB0;
            if (cs) digitalWrite(cs,0);
            var l = (C.OLED_WIDTH * initCmds[4] / 8);
            for (var p=0; p<l; p+=C.OLED_WIDTH) {
                spi.write([page++, 0x02, 0x10],dc);// display is centred in RAM
                spi.write(new Uint8Array(this.buffer,p,C.OLED_WIDTH));­
            }
            if (cs) digitalWrite(cs,1);
        };
    
  • Hi Gordon, thanks for the reply. I'll give those things a try and report back. Perhaps the ESP8266 issue is still present for ESP32, because I ran up the same code on my ESP32 dev board and it was noticeably slower than the Pico, but I haven't looked into whether this is a simple SPI clock frequency problem...

  • Ahh, it could well be an ESP32 SPI issue in your case - the folks working on it are doing a great job (especially as they're just doing it for fun), but I think the focus is still on getting things supported and bug-free, and not on outright speed.

  • Am focused on the Pico, trying to see what I can achieve with this little OLED screen.

    I updated the firmware - no noticeable difference. I already had the baud option set very high.

    I forked and updated the driver and made your suggested code fix and... it's definitely better! Maybe 30% faster.

    Can it can go faster or is this is the most we can expect? I assume the display is keeping up otherwise the signal would be corrupted?

    Is there a simple way to tell if graphics is the bottleneck?

    Thanks!

  • Ahh, it could well be an ESP32 SPI issue in your case

    The esp32 esp-idf provides a function to write x bytes, but the espruino code calls it in a loop a single byte at a time, so this would have to be rewritten to support mutli byte sends.

    I'm sure it would be faster then!

  • I had a very very brief look today with a logic analyser and there do appear to be gaps in the traffic, but I need to look a bit closer and provide some proper evidence back.

  • Spent some time this evening and there's no real issue with the Pico SPI performance. I can get screen repaint (at least the SPI portion) down to 20ms at 800k baud. Limit seems to be around 10ms. Attached is a quick write-up. Would be good to understand if/why Graphics class is holding things up.

    I will look at reproducing with the ESP32!


    1 Attachment

  • Nice - thanks! So you're finding that at 4Mbps it's not able to saturate the link?

    I think to top that we'd probably going to have to look at DMA - but 20ms update isn't bad, especially since the LCDs usually blur when you update them much above 20fps.

    You might be able to squeeze a bit more out of the 'flip' routine by unrolling the loop and doing some things like that though - I imagine there are still gaps from the JS execution speed.

  • 20ms is definitely ok provided the Javascript isn't adding too much. I'll do some more playing around.

    At 4Mbps it seemed to me there were two issues:

    • At a macro level, the time taken between each scan line of data, ie. each spi.write() becomes significant compared to the time taken to send a Uint8Array (single scan line)
    • At a micro level, the time taken between each byte becomes noticeable
  • I played around with the ESP32 for interest. One really obvious thing is that the baud setting doesn't seem to have any effect. It seems pinned at 100khz.

    The other is that there is a big gap between each byte even at this low frequency which I guess relates to @Wilberforce comment above.

    See pic.


    1 Attachment

    • esp32-capture.png
  • Yes I looked at that code too so not sure what is wrong. I might try writing a low level esp-idf app and check I can get higher speeds out of it.

  • I'm wondering if once the speed is set, it stays the same. Perhaps try setting to 400000 after a fresh boot?

  • It'd be interesting to try software SPI on ESP32. It'd almost certainly beat 100kHz and the time between bytes will probably be shorter too.

  • Hi @Wilberforce I tried your suggestion. I set it to 1Mhz after clean boot and the clock frequency is now correct but there are HUGE gaps between bytes. See pics -- the first shows 1Mhz for clock pulses and the second shows some bytes sent with the big gaps between.


    2 Attachments

    • ESP32-1mhz.png
    • ESP32-1mhz-gaps.png
  • Hi @Gordon tried your suggestion too and can get it up to around 700khz on the clock and the bytes sent for each scan line are looking good but there are big gaps between while the control bytes are sent... see pic (single complete page redraw all 0xFF).

    The good news is that a total screen redraw is around 40ms so framerate around 25. Not so bad.


    1 Attachment

    • ESP32-software-spi.png
  • @Wilberforce perhaps this give a clue about the HW SPI performance issues on ESP32...?

    https://github.com/espressif/esp-idf/iss­ues/368

  • I hacked in spi_device_queue_trans instead of spi_device_transmit. I got some unpredictable results at slower speeds but at 3Mhz and 4Mhz it was stable, and improved speed by about 2x. There are still big gaps though.

    Attached are pics of two captures. Both are at with 4Mhz clock requested. The slower one is the current code and the faster one is using spi_device_queue_trans. Like I said I hacked it in so I don't think you could use this to send and receive data -- only send.

    The faster capture shows roughly the same xfer (40ms) as software SPI so should be possible to improve on it.

    I think ultimately the Espruino calling code should be changed to support multi-byte transfer. Faster updates should then be possible, thus providing more time between updates for Espruino/game code. For some scenarios the fact that spi_device_queue_trans is non-blocking might be useful - not sure it is for regular apps though as you generally want to wait for flip() to complete.

    Anyway hope this is useful / of interest!


    2 Attachments

    • ESP32-4mhz-spi_device_queue_trans.png
    • ESP32-4mhz-spi_device_transmit.png
  • Thanks - yes! Multi-byte transfer isn't as easy as just refactoring to use the call, as memory isn't guaranteed to be in one flat area. If you have to allocate a flat area of memory and then copy data into it then it'll actually take longer on most platforms - it's just that the time will all be concentrated before the transmission starts rather than inbetween bytes.

    Realistically we'd be best off going straight for a function that sent SPI via DMA. We could then boost the standard SPI.send where possible, keep neopixel.write the same for all platforms, and add a new SPI.sendAsync or something that'd expose the fast functionality.

    If graphics drivers could do that then the next chunk of data could be prepared while the current one was sending.

  • Hi @Gordon, a DMA option sounds good longer term. I read a little about it in the ESP32 docs. You're recommended to allocate the memory for the SPI transfers with pvPortMallocCaps(size, MALLOC_CAP_DMA). It should automatically use DMA if it can.

    Where would I find equivalent info for the STM32 / Espruino? I'm keen to do a little hacking and would prefer to start with the Pico.

    It occurs to me it would be dangerous to return from the flip method while the transfer is still happening, because then the app might start modifying the memory that's still being written.

    Thanks

  • For the esp32 case it looks like this flag should be set:

    https://esp-idf.readthedocs.io/en/v2.0/a­pi/peripherals/spi_master.html

    SPI_USE_TXDATA
    It says:
    Sometimes, the amount of data is very small making it less than optimal allocating a separate buffer for it. If the data to be transferred is 32 bits or less, it can be stored in the transaction struct itself. For transmitted data, use the tx_data member for this and set the SPI_USE_TXDATA flag on the transmission. For received data, use rx_data and set SPI_USE_RXDATA. In both cases, do not touch the tx_buffer or rx_buffer members, because they use the same memory locations as tx_data and rx_data.

  • In the case of flip it should be reasonably easy - make it run in the background, but block if another SPI.send was in progress. That way you could just put a SPI.send([]) at the end to force it to wait for all the active sends to finish before returning.

    In terms of STM32 support, I'd look at the tv library - that uses DMA SPI for the TV output. However it's potentially a bit tricky choosing the correct DMA device in a way that works across all versions of STM32.

    Then there's nRF52 (and also STM32LL) as well - I believe there are some drivers that handle DMA'd SPI. Last time I looked there was a bug that meant that 1 byte sends failed, but in the newer SDKs I'm using now that's probably fixed.

  • I've experimented a little with the ESP32 using a custom Espruino class. I picked the ESP32 mainly because the DMA support is so much simpler. You just alloc some memory with the right flags and it handles the rest. I did look at doing this on STM32 but it's quite a bit more involved.

    The main challenge was understanding the spi_device_queue_trans call and limitations. For the SH1106 we need to send 1024 bytes in 16 byte pages, ie. 64 sends. The ESP32 won't allow you to queue 64 transactions, so I split these into 4 sets of 16 transactions and, between each set of 16, we wait for the transactions to complete before proceeding with the next 16.

    Cut a long story short, I managed to send 1024 bytes in just over 1ms... see pic. This is running the clock at 20Mhz.

    I'm not sending any of the command data at the moment so that will add a bit more delay -- it will double the number of transactions.

    Some of the remaining overhead might be due to the mutex locks in ESP-IDF mentioned in the linked post above. The only way to remove this completely would be to bypass this layer and go direct.

    I notice that the SD1780 display (and possibly SH1106 -- haven't checked) support both horizontal and vertical addressing where the pages will automatically wrap, which should mean no control data between data transactions -- and possibly being able to send all the data in a single transaction, which would be super-quick. Will try that next.

    Not quite sure why I'm so obsessed with maximum speed but it's a fun journey...

    I have no idea how the existing layers in Espruino could be refactored to take advantage of bulk send/DMA. It's clear to me that on ESP32 at least the graphics buffer should be allocated in co-operation with the SPI sending code (ie. the display driver). I got a bit lost following the code under jswrap_arraybuffer_constructor (found via lcdInit_ArrayBuffer). This code seems to re-use allocation of strings in blocks which is clearly not very DMA-friendly. But I may have read it wrong.

    Thanks


    1 Attachment

    • ESP32-dma-1.png
  • Yes, in Espruino common memory management is done with chained blocks to avoid some of the gc challenges that come otherwise and to support interpretation from source code as Espruino does. Most recently though, @Gordon has introduced the options of allocating memory as contiguous string of bytes. I'm though not familiar with the requirements to make that available for the DMA for bulk sending as you are looking for.

  • Post a reply
    • Bold
    • Italics
    • Link
    • Image
    • List
    • Quote
    • code
    • Preview
About

SPI performance on mono SSH1106 128x64 screen - bulk SPI transfer

Posted by Avatar for jugglingcats @jugglingcats

Actions