faster spiFlashWriteByte

You are reading a single comment by @Gordon and its replies. Click here to read the full conversation.

5 years ago •

Gordon
Hi,

Honestly, I think the only way to really be sure with stuff like this is to benchmark (and also look at the disassembly (... make lst). I just looked at the disassembly of the existing spiFlashWrite:
```
00044f88 <spiFlashWrite>:
static void spiFlashWrite(unsigned char *tx, unsigned int len) {
   44f88:	b5f0      	push	{r4, r5, r6, r7, lr}
   44f8a:	4401      	add	r1, r0
   44f8c:	f04f 43a0 	mov.w	r3, #1342177280	; 0x50000000
   44f90:	f04f 6400 	mov.w	r4, #134217728	; 0x8000000
   44f94:	f44f 2500 	mov.w	r5, #524288	; 0x80000
    int data = tx[i];
   44f98:	f810 6b01 	ldrb.w	r6, [r0], #1
    for (int bit=7;bit>=0;bit--) {
   44f9c:	2207      	movs	r2, #7
      nrf_gpio_pin_write((uint32_t)pinInfo[SPIFLASH_PIN_MOSI].pin, (data>>bit)&1 );
   44f9e:	fa46 f702 	asr.w	r7, r6, r2
    if (value == 0)
   44fa2:	07ff      	lsls	r7, r7, #31
    p_reg->OUTCLR = clr_mask;
   44fa4:	bf54      	ite	pl
   44fa6:	f8c3 450c 	strpl.w	r4, [r3, #1292]	; 0x50c
    p_reg->OUTSET = set_mask;
   44faa:	f8c3 4508 	strmi.w	r4, [r3, #1288]	; 0x508
    for (int bit=7;bit>=0;bit--) {
   44fae:	f112 32ff 	adds.w	r2, r2, #4294967295	; 0xffffffff
   44fb2:	f8c3 5508 	str.w	r5, [r3, #1288]	; 0x508
    p_reg->OUTCLR = clr_mask;
   44fb6:	f8c3 550c 	str.w	r5, [r3, #1292]	; 0x50c
   44fba:	d2f0      	bcs.n	44f9e <spiFlashWrite+0x16>
  for (unsigned int i=0;i<len;i++) {
   44fbc:	4281      	cmp	r1, r0
   44fbe:	d1eb      	bne.n	44f98 <spiFlashWrite+0x10>
}
   44fc0:	bdf0      	pop	{r4, r5, r6, r7, pc}
```
And it seems reasonably tight - everything is already inlined. I'm not sure you're really going to see a noticeable difference in speed especially as I believe IO is stuck at 16MHz. If it did turn out that unrolling the loop was significantly faster then I'd be interested, but if at all possible I'd like to just get the compiler to do it so it wasn't such a maintenance headache: https://stackoverflow.com/questions/4071690/tell-gcc-to-specifically-unroll-a-loop

I don't know for sure what the flash chip used in the Bangle is I'm afraid - as far as I can tell it's marked A1Y200

Honestly I developed the whole thing from scratch on one Bangle.js device which is still working fine, so I don't think you'll brick your Bangle if you mess with the flash code. The bootloader doesn't use flash at all so you should still get able to get to it regardless of what you change.

I'd strongly suggest you open your Bangle and attach a nRF52DK debugger to it though - it'll make uploading firmware so much faster.

Another option is to use something like an MDBT42 breakout, attach it to an external flash chip, and try using 'Inline C' in the Web IDE to see how fast you can get the IO going.

Some things that I think really might increase speed though:
- Create a spiFlashRead so that when you're reading data from flash you're not also writing - that would be a really easy win
- Keep CS asserted on the flash chip after a read, and if the next byte(s) requested come right after the previous bytes requested then you don't need to send the 4 bytes of address again - you just clock out more data. That should be pretty easy as well.
- Merging the writes - don't use OUTSET and OUTCLR but write directly to OUT - you may be able to change the clock and data at the same time on one of the clock edges which would help to overcome the 16Mhz limitation.
- The original watch firmware actually used QSPI (so 4 data wires) but it did it in software so it was basically slower than just bit-banging with 1 data bit! It's a good example of why you need to actually benchmark before optimising :) With some thought you may be able to do it properly though - using multiply + shift (or just a lookup table) to get the bits in the correct places for their pins with only a few clock cycles (taking some inspiration from https://graphics.stanford.edu/~seander/bithacks.html#ReverseByteWith64Bits).
- Using hardware SPI - you may be able to get this faster than bit-banged SPI, but supposedly the max bitrate is still 8Mbps
- Preloading with hardware SPI and DMA - a lot of the time when executing from flash, Espruino will load data in chunks of 16 bytes. You could potentially pre-load the next 16 bytes using DMA while Espruino runs through the first 16 bytes

faster spiFlashWriteByte

About

Espruino